Introducción a SQL


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introducción a SQL

  1. 1. FROM BUSINESS OBJECTIVES TO DATA MINING: TOWARDS A SISTEMATIC WAY OF DATA MINING PROJECT DEVELOPMENT Ernestina Menasalvas Facultad de Informática Universidad Politecnica de Madrid. Spain [email_address] November 2004
  2. 2. Background(I) <ul><li>1995: doctoral student. </li></ul><ul><ul><li>Visit University of Regina (Prof. Ziarko) </li></ul></ul><ul><ul><li>Visit Warsaw University (Prof. Pawlak) </li></ul></ul><ul><li>1998: Defend thesis. Data Mining process model (Anita Wasilewska & C. Fernandez-Baizan) </li></ul><ul><li>Since then: </li></ul><ul><ul><li>Data Bases Professor: Data bases, data mining </li></ul></ul><ul><ul><li>Coordinator of the Data Mining group at Facultad de Informática UPM </li></ul></ul><ul><ul><ul><li>Techniques: Rough Sets, Bayes, … </li></ul></ul></ul><ul><ul><ul><li>Methodologies for data mining process management </li></ul></ul></ul><ul><ul><ul><ul><li>Evaluation in Data Mining </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Experimentation in Web Mining </li></ul></ul></ul></ul><ul><ul><ul><li>Web Mining: Web Goal Mining </li></ul></ul></ul>
  3. 3. Background(II) <ul><li>Projects developed: </li></ul><ul><ul><li>Pure Research: </li></ul></ul><ul><ul><ul><li>Data Mining to be integrated on RDBMS </li></ul></ul></ul><ul><ul><ul><li>Web Profiler </li></ul></ul></ul><ul><ul><ul><li>Methodology for Data Mining process management </li></ul></ul></ul><ul><ul><li>Research and application: </li></ul></ul><ul><ul><ul><li>Data Mining applied on different domains: </li></ul></ul></ul><ul><ul><ul><ul><li>Car dealers </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Travel agency </li></ul></ul></ul></ul><ul><ul><ul><ul><li>… . </li></ul></ul></ul></ul>
  4. 4. Data Mining Project Development <ul><li>Methodologies for Data Mining project development </li></ul><ul><ul><li>Is it really Data Mining a Science? </li></ul></ul><ul><ul><li>Are we developing proyects as an art? </li></ul></ul><ul><ul><li>Has the research got the same results in all the areas?? </li></ul></ul><ul><ul><ul><li>Algorithms </li></ul></ul></ul><ul><ul><ul><li>Data Preparation </li></ul></ul></ul><ul><ul><ul><li>Data enrichment </li></ul></ul></ul><ul><ul><ul><li>Conceptualization of Data Mining problems </li></ul></ul></ul>
  5. 5. Data Mining: an art, a science? <ul><li>Since it appeared a lot of algorithms have been programmed </li></ul><ul><li>Standards: </li></ul><ul><ul><li>Crisp-DM </li></ul></ul><ul><ul><li>SEMMA </li></ul></ul><ul><ul><li>PMML 3.0 </li></ul></ul><ul><li>Process depends on the expertise of the data miner </li></ul><ul><li>User speaks about business problems </li></ul><ul><li>Data Miner speaks about algorithms </li></ul>
  6. 6. Data Mining as a project <ul><li>Data Mining is data intensive activity </li></ul><ul><ul><li>Data understanding </li></ul></ul><ul><ul><li>Data Preparation </li></ul></ul><ul><li>Database manager: </li></ul><ul><ul><li>Transactional databases </li></ul></ul><ul><ul><li>Datawarehouses </li></ul></ul><ul><li>The end result of a data mining project is a tool (software project) for better decision making process: </li></ul><ul><ul><li>Software development project </li></ul></ul><ul><li>IT department has to be involved </li></ul>
  7. 7. Project Management <ul><li>Why? </li></ul><ul><ul><li>In order to organize the process of develpoment and to produce a project plan </li></ul></ul><ul><li>How? </li></ul><ul><li>Establish how the process is going to be develop: </li></ul><ul><ul><li>Sequential </li></ul></ul><ul><ul><li>Incremental </li></ul></ul><ul><li>What? </li></ul><ul><li>Establish how is the process is splitted into phases and define the tasks to be developed in each step: </li></ul><ul><ul><li>RUP </li></ul></ul><ul><ul><li>XP </li></ul></ul><ul><ul><li>COMMONKADS </li></ul></ul>LIFECYCLE MODELS METHODOLOGY <ul><li>Way of making things </li></ul><ul><li>Independent of the process being developed </li></ul><ul><li>Particular tasks </li></ul><ul><li>Detail of tasks to be developed </li></ul>
  8. 8. Common pitfall of data mining implementation <ul><li>The common pitfall of data mining implementation the following: </li></ul><ul><ul><li>Not being able to efficiently communicate mining results within an organization. </li></ul></ul><ul><ul><li>Not having the right data to conduct effective analysis. </li></ul></ul><ul><ul><li>Not using existing data correctly. </li></ul></ul><ul><ul><li>Not being able to evaluate results </li></ul></ul><ul><li>Questions that arise: </li></ul><ul><ul><li>Can the adequateness of a set of data for a problem be established when preparing the project plan? </li></ul></ul><ul><ul><li>How the set of data can be used to produce the expected results? </li></ul></ul><ul><ul><li>How we can evaluate the results? </li></ul></ul><ul><ul><li>Cost estimation? </li></ul></ul>
  9. 9. Data Mining Approaches <ul><li>Vendor independent: </li></ul><ul><ul><li>CRISP-DM </li></ul></ul><ul><li>Based on the commercial tools: </li></ul><ul><ul><li>CAT’s </li></ul></ul><ul><ul><li>SEMMA </li></ul></ul><ul><li>CRM Methodology: </li></ul><ul><ul><li>CRM Catalyst </li></ul></ul>Model Process Not Real Methodology Based on Crisp-DM Globlal CRM process Does not concentrate on Data Mining step
  10. 10. Cross-Industry Standard Process for Data Mining:CRISP-DM
  11. 11. Data Mining as a project: CATs <ul><li>CATs : Clementine Application Templates : [CATs] </li></ul><ul><ul><li>Specific libraries of best practices that provide inmediate value right out of the box </li></ul></ul><ul><ul><li>Following the CRISP-DM standard. Every CAT stream is assigned to a CRISP-DM phase </li></ul></ul><ul><ul><li>They provide long term value as they can always be used with a new data set for new insight in other projects. </li></ul></ul><ul><li>Available as an add-on module to Clementine, include: </li></ul><ul><ul><li>Telco CAT - improve retention and cross-selling efforts for telecommunications </li></ul></ul><ul><ul><li>CRM CAT - understand and predict customer migration between segments, </li></ul></ul><ul><ul><li>Microarray CAT - accelerate biological discoveries, find genes Fraud CAT - predict and detect instances of fraud in financial transactions, claims, tax returns … </li></ul></ul><ul><ul><li>Web CAT </li></ul></ul>
  12. 12. What is a CAT? [CATs]
  13. 13. SEMMA(1) <ul><li>SEMMA ( Sample, Explore, Modify, Model, Assess ): [SEMMA] </li></ul><ul><ul><li>Is not a data mining methodology </li></ul></ul><ul><ul><li>Rather a logical organization of the functional tool set of SAS Enterprise Miner for carrying out the core tasks of data mining. </li></ul></ul><ul><ul><li>Enterprise Miner can be used as part of any iterative data mining methodology adopted by the client. </li></ul></ul><ul><ul><li>Naturally steps such as formulating a well defined business or research problem and assembling quality representative data sources are critical to the overall success of any data mining project. </li></ul></ul>
  14. 14. SEMMA(2) <ul><li>SEMMA i s focused on the model development aspects of data mining:[SEMMA] </li></ul><ul><ul><li>Sample the data to extract a portion of a large data set big enough to contein significant information, yet small to manipulate quickly. </li></ul></ul><ul><ul><li>Explore the data by searching for anticipated trends and anomalies in order to gain understanding and ideas. </li></ul></ul><ul><ul><li>Modify the data by creating selecting and transforming the variables to focus the model selection problem. </li></ul></ul><ul><ul><li>Model the data allowing the software to search automatically for a combination of data that reliably predicts a desired outcome. Modelling techniques include neural networks, tree-clasiffiers, statistical models, etc. </li></ul></ul><ul><ul><li>Assess the data by evaluating the usefulness and reliability of the findings from the data mining process and estimate how well it performs. </li></ul></ul>
  15. 15. Methods for Project Management: CRM Catalyst(1) <ul><li>Developed jointly by CustomISe, MACS and SalesPathways. Together they have formed the Catalyst Foundation </li></ul><ul><li>Motivations: </li></ul><ul><li>CRM projects are difficult to execute successfully because of the wide range of factors influencing their success. So it can take a long time to make CRM work properly for an organisation. </li></ul><ul><li>Solution: CRM Catalyst. </li></ul><ul><li>Methodology acts as a catalyst for CRM projects enabling them to achieve their objectives more reliably and in less time. </li></ul><ul><li>It gives a project life cycle with a set of defined phases broken down into steps with clearly stated inputs and outputs. </li></ul>
  16. 16. Methods for Project Management: CRM Catalyst(2) Implementation requires Data Mining development process Implementation is Knowledge intensive The resutls are obtained in a progressive way Progressive Lifecycle Model In some steps Knowledge Intensive Methdology could be appropriate
  17. 17. Main steps in a Data Mining Project <ul><li>Define the goals: </li></ul><ul><ul><li>Business and data mining experts together have to define the goals </li></ul></ul><ul><ul><li>Each goal must be defined with measurements for success </li></ul></ul><ul><li>Obtain the models: </li></ul><ul><ul><li>Apply data mining algorithms. </li></ul></ul><ul><ul><li>Preprocesing is important </li></ul></ul><ul><li>Evaluate results: </li></ul><ul><ul><li>ascertaine the value of an object according to specified criteria, operationalised in terms of measures. </li></ul></ul><ul><li>Deploy: </li></ul><ul><ul><li>Decide patterns and models that can be deployed </li></ul></ul><ul><li>Evaluate </li></ul><ul><ul><li>After product working it should be contrasted the result </li></ul></ul>
  18. 18. 1. Define the goals <ul><li>Distinguish between : </li></ul><ul><ul><li>Data Mining goals </li></ul></ul><ul><ul><li>Business goals </li></ul></ul><ul><li>How do we translate? </li></ul>Clasification Estimation Association ¿? ¿? ¿? Increase the lifetime value of valuable customers It has to be solved in the Business Understanding step of CRISP-DM
  19. 19. Business Understanding in the CRISP-DM Process Business Understanding Determine Business Objectives Assess Situation Determine Data Mining Goals Produce Project Plan Background Business Objectives Business Success Criteria Inventory & Resources Reqs, Assumptions &Constraints Risks & Contingencies Terminology Costs & Benefits Data Mining Goals Data Mining Success Criteria Project Plan Initial Assessment of Tools & Techniques
  20. 20. 1.1 Determine Business objectives and success criteria <ul><li>Not only business objectives have to be established but measures in order to be able to evaluate the results </li></ul><ul><li>Business objectives: </li></ul><ul><ul><li>What is the customer's primary objective? </li></ul></ul><ul><ul><ul><li>Increase the number of loyal customers </li></ul></ul></ul><ul><ul><ul><li>Selling more of a certain product </li></ul></ul></ul><ul><ul><ul><li>Have a positive marketing campaing </li></ul></ul></ul><ul><li>Business success criteria: </li></ul><ul><ul><li>What constitutes a successful outcome of the project? </li></ul></ul><ul><ul><li>Objectives measures so that the success can be established </li></ul></ul><ul><ul><li>ROI </li></ul></ul>
  21. 21. 1.2 Costs & Benefits <ul><li>Perform a cost-benefits analysis </li></ul><ul><ul><ul><li>Compute the benefits of the project </li></ul></ul></ul><ul><ul><ul><ul><li>Which measures do we have? </li></ul></ul></ul></ul><ul><ul><ul><ul><li>ROI </li></ul></ul></ul></ul><ul><ul><ul><ul><li>APEX </li></ul></ul></ul></ul><ul><ul><ul><ul><li>OPEX.... </li></ul></ul></ul></ul><ul><ul><ul><li>Compute the costs of the project (equipment, human resources...) </li></ul></ul></ul><ul><ul><ul><ul><li>Which methodology do we have? </li></ul></ul></ul></ul><ul><ul><ul><ul><li>COCOMO for sortware </li></ul></ul></ul></ul><ul><ul><ul><li>Quantify the risk that the project fails </li></ul></ul></ul><ul><ul><ul><ul><li>Knowledge not available </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Data Not available </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Proper tools </li></ul></ul></ul></ul>
  22. 22. Data Mining Estimation Model <ul><li>Establishing a parametrical estimation model for Data Mining (Marban’03) </li></ul>DMCOMO (Data Mining COst MOdel)
  23. 23. Data Mining Cost Estimation <ul><li>Main factors in a Data Mining project </li></ul><ul><ul><li>Data Sources (number, kind, nature, …) </li></ul></ul><ul><ul><li>Data mining problem to be solved (descriptive, predictive, …) </li></ul></ul><ul><ul><li>Development platform </li></ul></ul><ul><ul><li>Available tools </li></ul></ul><ul><ul><li>Expertise of the development team </li></ul></ul><ul><li>Drivers </li></ul><ul><ul><li>Data Drivers </li></ul></ul><ul><ul><li>Model Drivers </li></ul></ul><ul><ul><li>Platform Drivers </li></ul></ul><ul><ul><li>Tools and techniques Drivers </li></ul></ul><ul><ul><li>Project Drivers </li></ul></ul><ul><ul><li>People Drivers </li></ul></ul>
  24. 24. 1.3 Data Mining goals and success <ul><li>Data mining goals: </li></ul><ul><ul><li>Translate the customer's primary objective into a data mining goal, e.g. </li></ul></ul><ul><ul><ul><li>Loyalty program translated into segmentation problem </li></ul></ul></ul><ul><ul><ul><li>Decreasing the attrition rate transformed into classification problem </li></ul></ul></ul><ul><li>Data mining success criteria: </li></ul><ul><ul><li>Determine success in technical terms </li></ul></ul><ul><ul><ul><li>Translate the notion of sucess into confidence, support and lift and other parameteres </li></ul></ul></ul><ul><ul><ul><li>Determine de cost of errors </li></ul></ul></ul><ul><li>How do we make the translation? </li></ul>
  25. 25. Methodology <ul><li>Which is the methodology to be followed to translate business objectives into data mining objectives? </li></ul><ul><li>Unluckily, there is no such methodology. First we have to solve: </li></ul><ul><ul><li>How a business objective is expressed? </li></ul></ul><ul><ul><li>What is a data mining goal? </li></ul></ul><ul><ul><li>How are data mining goals achieved? </li></ul></ul><ul><ul><li>Which are the requirements of data mining functions? </li></ul></ul>In order to describe everything in a standard way: Conceptualize the problem
  26. 26. Conceptualization in other disciplines <ul><li>Data Bases: </li></ul><ul><ul><li>E/R diagrams </li></ul></ul><ul><ul><li>Independent of the domain </li></ul></ul><ul><ul><li>A tool for business understanding and for data base designer </li></ul></ul><ul><ul><li>Translation from E/R to implementation </li></ul></ul>Internal Schema Conceptual Schema External view 1 External view n
  27. 27. 3 levels proposed architecture Internal Schema Conceptual Schema Business problem Business problem Requirements of algorithms will be solved at this level Tools requirements to be solved SAS, WEKA, Clementine…
  28. 28. 3 layers architecture for data mining <ul><li>It is the bridge: </li></ul><ul><ul><li>Between business goals and the final tool </li></ul></ul><ul><ul><li>Independent of the domain </li></ul></ul><ul><li>Provides independence: </li></ul><ul><ul><li>Changes in the tool do not reflect to the solution </li></ul></ul><ul><li>It has to be decided what to model in the conceptualization </li></ul><ul><li>Automatic translation of business goals into data mining goals </li></ul><ul><li>Data Mining goals +constraints = feasible data mining goals </li></ul>
  29. 29. Elements to conceptualize <ul><li>Elements to be taken into account: </li></ul><ul><ul><li>Data: </li></ul></ul><ul><ul><ul><li>Quality from data mining point of view </li></ul></ul></ul><ul><ul><ul><li>Adequateness for the problem </li></ul></ul></ul><ul><ul><ul><li>Classification for data mining purposes </li></ul></ul></ul><ul><ul><li>Knowledge: </li></ul></ul><ul><ul><ul><li>Related to the process being analyzed </li></ul></ul></ul><ul><ul><ul><li>Related to the data used </li></ul></ul></ul><ul><ul><li>People </li></ul></ul><ul><ul><ul><li>Owners of data </li></ul></ul></ul><ul><ul><ul><li>Experts in the process </li></ul></ul></ul><ul><ul><li>Data mining problems requirements </li></ul></ul><ul><ul><li>Data mining methods requirements </li></ul></ul>
  30. 30. Proposed process
  31. 31. DMMO <ul><li>Data Mining Modelling Objects: </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Knowledge </li></ul></ul><ul><ul><li>Constraints of data and applications </li></ul></ul><ul><ul><li>Data Mining objects </li></ul></ul><ul><ul><ul><li>Algorithms </li></ul></ul></ul><ul><ul><ul><li>Measures </li></ul></ul></ul><ul><ul><ul><li>Methods </li></ul></ul></ul><ul><li>To bridge the gap between data miners and business users </li></ul>
  32. 32. Are data adequate for analysis? <ul><li>The adequateness of the data is analyzed taking into account goals to fulfil. </li></ul><ul><li>Data together with the knowledge extracted from the experts can be transformed so that just by being the input of a certain data mining algorithm will produce the required patterns. </li></ul><ul><li>Quality of the data, in this context: </li></ul><ul><ul><li>is not only related to the technical quality: proper model, percentage of null values, </li></ul></ul><ul><li>but also has to do with: </li></ul><ul><ul><li>meaning of the attributes, </li></ul></ul><ul><ul><li>Where each piece of data comes from, </li></ul></ul><ul><ul><li>relationship among data, and </li></ul></ul><ul><ul><li>finally how the data fulfil the requirements of the data mining functions </li></ul></ul>
  33. 33. 2. Data Mining: obtain models <ul><li>Apply data mining process model </li></ul><ul><li>Associated problems solved by the 3 layers architecture: </li></ul><ul><ul><li>Comparison of approaches </li></ul></ul><ul><ul><li>Evaluate costs </li></ul></ul><ul><ul><li>Pros and cons of approaches </li></ul></ul><ul><li>Only experience or a conceptualization can help </li></ul><ul><li>The conceptual model will help to establish the process to obtain each feasible model. </li></ul><ul><li>Requirements and transformations implicit in the model </li></ul>
  34. 34. 2.1 Determine type of problem <ul><ul><li>What are data mining problems? </li></ul></ul><ul><ul><ul><li>Classification </li></ul></ul></ul><ul><ul><ul><li>Estimation </li></ul></ul></ul><ul><ul><ul><li>Association </li></ul></ul></ul><ul><ul><ul><li>Segmentation </li></ul></ul></ul><ul><ul><li>In the conceptual model requirements for each type will be settled </li></ul></ul>
  35. 35. 2.2 Apply CRISP-DMprocess model <ul><ul><li>Data Mining problem has to be settled before going into modeling step </li></ul></ul><ul><ul><li>Requierements will be established in Business understanding </li></ul></ul><ul><ul><li>Requierements will be checked in Data Understanding and data Preparation </li></ul></ul><ul><ul><li>Preparation will be guided by conceptual model </li></ul></ul><ul><ul><li>Evaluation on feasibility can be done before applying the model </li></ul></ul>Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Business Understanding
  36. 36. 3. Evaluate results <ul><li>[Spilipopou, Berendt] </li></ul><ul><li>Evaluation: the act of ascertaining the value of an object according to specified criteria , operationalised in terms of measures . </li></ul><ul><ul><li>Object= model already obtained </li></ul></ul><ul><ul><li>Criteria and Measures and has to do with goals </li></ul></ul><ul><li>Evaluation requires a well-defined notion of success , which must be in place before </li></ul><ul><ul><li>the evaluation takes place </li></ul></ul><ul><ul><li>the data mining phase starts </li></ul></ul><ul><ul><li>any work with the data starts </li></ul></ul><ul><li>i.e. already during the business understanding process. </li></ul><ul><li>Here once again conceptualization plays its role </li></ul>
  37. 37. Evaluation in the CRISP-DM Process <ul><li>The CRISP-DM process is </li></ul><ul><ul><li>a non-ending circle of iterations </li></ul></ul><ul><ul><li>a non-sequential process, where backtracking at previous phases is usually necessary </li></ul></ul><ul><li>In each sequential instantiation evaluation takes place: </li></ul><ul><li>But it is a cycle </li></ul><ul><li>In all the iterations all the steps should be revisited </li></ul><ul><li>Results have to be evaluated!! </li></ul>Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Business Understanding
  38. 38. 4. Deployment <ul><li>All the models that have possitive evaluation can be deployed </li></ul><ul><li>For measurements of success to trust deployment has to follow rules established at the beginning of the project </li></ul><ul><ul><li>The real evaluation has not yet been performed </li></ul></ul>
  39. 39. 5. Evaluate after deployment <ul><li>After deployment there is the need to proof that the improvements are really due to the actions taken after a data mining discovery and not to any other factor or action carried out in the company </li></ul><ul><li>None of the obvious claims about success of data mining have ever been systematically tested. </li></ul><ul><li>Experiments are crucial to establish if the impact of the deployment is really positive or negative </li></ul><ul><li>Experiments have to be designed at the beginning of the project </li></ul>
  40. 40. Conclusions <ul><li>Data mining projects are being developed more as art than a science </li></ul><ul><li>Many algorithms have been implemented but no systematically proof of one better than another in real case is done after deployment </li></ul><ul><li>Conceptual model is required: </li></ul><ul><ul><li>To map business goals to the model </li></ul></ul><ul><ul><li>To map data mining algorithms to a conceptual model </li></ul></ul><ul><li>Achievements of the model: </li></ul><ul><ul><li>Will be used along the process to guide the project </li></ul></ul><ul><ul><li>Evaluation tool </li></ul></ul>
  41. 41. Future works <ul><li>Conceptual model </li></ul><ul><ul><li>Define DMMO objects </li></ul></ul><ul><li>Evaluation techniques related to the model: </li></ul><ul><ul><li>Evaluate data mining goals </li></ul></ul><ul><ul><li>Evaluate business goals </li></ul></ul><ul><li>Experimentation methods: </li></ul><ul><ul><li>obstursively and </li></ul></ul><ul><ul><li>non obstrusivelsly </li></ul></ul>
  42. 42. References <ul><li>Evaluation in Web mining Tutorial at ECML/PKDD 2004 Pisa, Italy; 20th September, 2004. Bettina Berendt, Myra Spiliopoulou, Ernestina Menasalvas </li></ul><ul><li>Towards a Methodology for Data mining Project Development : The Importance of Abstraction. Menasalvas, Millán, Gonzalez-Aranda, Segovia </li></ul><ul><li>Bettina Berendt , Andreas Hotho , Dunja Mladenic , Maarten van Someren , Myra Spiliopoulou, Gerd Stumme : Web Mining: From Web to Semantic Web, First European Web Mining Forum, EMWF 2003, Cavtat-Dubrovnik, Croatia, September 22, 2003, Revised Selected and Invited Papers Springer 2004 </li></ul><ul><li>Myra Spiliopoulou, Carsten Pohle : Modelling and Incorporating Background Knowledge in the Web Mining Process. Pattern Detection and Discovery 2002 : 154-169 </li></ul><ul><li> </li></ul><ul><li> clementine / cats .htm </li></ul><ul><li>www. sas .com/technologies/analytics/datamining/miner/ semma .html </li></ul><ul><li>www. crm </li></ul><ul><li>www. e m e e s/whit e pap e r.html </li></ul>
  43. 43. THANKS