Key Principles Of Data Mining


Published on

My Presentation from the Advanced Analytics Conference from 23rd June in London

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data mining could be thought of as essentially ‘Customer analytics’, or more precisely, analytics instigated at the request of a customerwith the purpose of gaining insight (knowledge) of some data. Typically we view customer analytics as predictive and descriptive modelling, which isusually in relation to large CRM (Customer Relationship Management)/Marketing databases. It is often the case that data mining exercises model customers, however any entity for which there is data stored can be investigated. Others could include: households, websessions, calls, etc.
  • At this point, we must consider if the model does indeed reflect the reality ofwhat it is we’re attempting to model, and (more importantly)that the model will in fact achieve the business objectives. Thus the model must be thoroughlyevaluated, and this includes reviewing the steps taken to construct themodel. In particular, it is essential that we ensure the model incorporatesevery important business issue. This may mean that the model needs to bereviewed and worked on – so we have some interaction between phases 4and 5. This phase typically concludes with a decision on how the datamining results achieved will be used. 
  • Data description and summarisationInitial exploratory data analysis can help to investigate and understand the data, and provide potential hypotheses for hidden information. Summarisation also plays a significant role in the presentation of final results.SegmentationA segmentation data mining analysis aims to separate the data into interesting and meaningful subgroups or classes, so that members of a subgroup share common characteristics. A classic example would be a shopping basket analysis where the segments of baskets depends on the items they contain.Concept descriptionsConcept description aims to give an understandable description of the concepts or classes. This is not done to produce complete models with high prediction accuracy, but instead it is done in order to gain insights. E.g. a company might be interested in learning more about their loyal and disloyal customers. From concept descriptions such is this, a company could then conclude what might be done in order to keep customers loyal, or transform disloyal customers into loyal ones. Concept description has close connections with both segmentation and classification. Segmentation could lead generating a concept or class of data without really any understandable description of the elements in that class. ClassificationClassification has connections to almost all other problem types. An example of this is the following: credit scoring attempts to assess the credit risk of a new customer. This problem can be transformed into a classification problem by partitioning customers into two new classes: good customers, and bad customers. This new model can then be used to assign prospective customers into one of the two classes available, and hence either accept or reject them.PredictionPrediction problems are similar to classification problems, with one major difference: in prediction, the target attribute (or class) is not a qualitative discrete attribute, but instead a continuous one. This means that the aim of a prediction model is to find and assign a numerical value of a target attribute for unseen objects.In particular, if the prediction model is dealing with time series data, then it is often referred to as forecasting.Dependency analysisDependency analysis consists of finding a model that describes significant dependencies (or associations) between data items or events. Dependencies can be used to predict the value of a data item given information on other items. Dependencies can be used for predictive modelling; however in general they are mostly used for understanding.
  • Key Principles Of Data Mining

    1. 1. Key Principles of Data Mining<br />Presentation by Tobie Muir (Data-Decisions)<br />Henry Stewart Briefing:<br />An Introduction to Marketing Analytics<br />London, 23rd June 2010<br />
    2. 2. What is data mining?<br />“Data mining is the process of finding patterns in your data which you can use to do your business better”<br />Alan Montgomery, formerly Managing Director, Integral Solutions Limited (now part of IBM/SPSS)<br /><ul><li>Nowadays every credit card used, every transaction processed, every loan application, etc. is recorded digitally, creating massive databases of raw information.
    3. 3. These datasets can be incomprehensibly large – too large to analyse without the aid of computer-driven processes.
    4. 4. The role of data mining is to introduce (semi) automated computer-driven processes and statistical techniques, to extract meaningful patterns from such data with the goal of improving the business in question. A classic example in marketing is using DM insights to achieve revenue with less marketing budget.
    5. 5. For very large datasets data mining can focus on a sample within a dataset – instead of analysing millions (billions!) of records, which can be computationally expensive / slow – we analyse a subset of this data in the hope that patterns prevalent in the subset also apply to the entire dataset.
    6. 6. Careful analysis is then required to determine whether any patterns found are meaningful: they could be spurious, coincidental, or it may be such a pattern is only found in the subset. </li></ul>2<br />Copyright © 2010 Data-Decisions Ltd<br />
    7. 7. Where does data mining fit with BI tools?<br /><ul><li>Data mining is generally thought of as a smaller subset of Business Intelligence (BI).
    8. 8. Business intelligence tools can also encompass the extraction, storage, visualisation and distribution of business information, not just the analysis of business data.
    9. 9. Leading BI tools will typically contain data mining capabilities as well as other more general activities including decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis and forecasting.</li></ul>Business Intelligence<br /><ul><li>Decision Support Systems
    10. 10. Online analytical processing (OLAP)
    11. 11. Statistical analysis and forecasting
    12. 12. Query and Reporting</li></ul>Data Mining<br />3<br />Copyright © 2010 Data-Decisions Ltd<br />
    13. 13. Business Intelligence<br />Data Mining<br />4<br />Copyright © 2010 Data-Decisions Ltd<br />
    14. 14. The Relationship between Data Mining and Advanced Analytics<br />Advanced Analytics<br />Data Mining <br />Focus on Customers<br />Everything else...<br /><ul><li>Segment and Target
    15. 15. Optimise Best Media Mix
    16. 16. Optimise Responses </li></ul>Customer Acquisition<br /><ul><li>Churn Prediction</li></ul>Customer Retention<br /><ul><li>Campaign Optimisation
    17. 17. Cross-Sell
    18. 18. Up-Sell</li></ul>Customer Expansion<br />5<br />Copyright © 2010 Data-Decisions Ltd<br />
    19. 19. The CRISP data mining process<br />CRISP stands for Cross-Industry Standard Process for Data Mining<br />Developed by the CRISP-DM consortium, consisting of DaimlerChrysler (formally Daimler-Benz), SPSS (formally ISL), and NCR. <br />The idea was to standardise the process of data mining across the industry – a common pattern for the process of data mining was established among all collaborators, and CRISP-DM was also a mechanism to introduce uniform terminology and differentiation. <br />CRISP-DM 1.0 was rolled out in Aug 2000, including detailed documentation<br />To the right is the standard six-part CRISP model for how the data mining process occurs from this document:<br />The model highlights the relationships and interdependencies between all 6 phases – the data mining process is one that is dynamic<br />6<br />Copyright © 2010 Data-Decisions Ltd<br />
    20. 20. The CRISP data mining processPhase 1 and 2<br />1. Business understanding<br />We begin by understanding the requirements of the project<br />from the business perspective – what does the company in<br />question want to achieve/ get out of this? What are the<br />priorities? How will we the measure outcome? We conclude<br />this phase by producing a preliminary (phase) plan to tackle<br />the established objectives.<br /> <br />2. Data understanding<br />The data understanding phase has two broad aims. The first<br />is to test the data (on which the analysis will be based) in<br />order to identify any quality issues. The second is to try and<br />discover any initial insights into the data that might provide<br />any additional meaningful information. <br />Some basic data visualisation – scatter plots, bar charts,<br />distribution analysis is a great way to get to grips with the<br />data, spot any immediate patterns, as well as test the<br />general data sufficiency, which leads logically onto the next<br />phase, Data Preparation. <br />7<br />Copyright © 2010 Data-Decisions Ltd<br />
    21. 21. The CRISP data mining process Phase 3 and 4<br />3. Data preparation<br />The data preparation phase does exactly as its name<br />suggests: this is the phase when the initial (raw) data<br />is modified to produce the final dataset upon which<br />the analysis will take place. <br />Data preparation covers all activities that turn the raw<br />data into the final dataset, ready for the modelling<br />phase, including merging separate datasets and further<br />data pooling, table/record/attribute selection, missing<br />values imputation, data cleaning and spurious data<br />removal and transformation. It is also advisable to<br />consider how to partition the data into modelling and<br />testing segments (typically on a 70/30 split, depending<br />on data volumes).<br />Data preparation, in my experience, is the most time<br />consuming, but absolutely ESSENTIAL, phase out of the<br />entire CRISP process.<br />8<br />Copyright © 2010 Data-Decisions Ltd<br />
    22. 22. The CRISP data mining process Phase 3 and 4<br />4. Modelling<br />The modelling phase is the heart of the CRISP<br />model. This is the point when we take the modified<br />dataset and apply (typically) several modelling<br />techniques. <br />We would want to use several <br />techniques as no single technique is perfect, and<br />the range of results gathered should overcome the<br />limitations of any one particular model. There is<br />some interaction between phases 3 and 4: different<br />techniques may require the data in different forms,<br />and so it may be necessary to prepare the data in<br />multiple ways to prep it for the various models.<br />We will cover some of the different modelling<br />techniques later in the presentation.<br />9<br />Copyright © 2010 Data-Decisions Ltd<br />
    23. 23. The CRISP data mining process Phase 5<br />5. Evaluation<br />There are many different techniques and methods for<br />evaluating the models created during the modelling phase.<br />First and foremost you are looking to compare the model<br />error rates, or inversely, the model accuracy rates – this is<br />estimated by how well the models perform on the test<br />data (data that was omitted during the model building<br />phase). <br />There are a number of ways to measure<br />this, but most methods simply amount to providing a score<br />that allows you to choose the model with the lowest error<br />rate. <br />Lift charts provide a very effective way to visualise and<br />compare model performances over the test set. This is also a<br />good way to access whether you may need to combine<br />models together to arrive at an overall better solution. <br />10<br />Copyright © 2010 Data-Decisions Ltd<br />
    24. 24. Model Evaluation<br /><ul><li>Lift Charts</li></ul>11<br />Copyright © 2010 Data-Decisions Ltd<br />
    25. 25. The CRISP data mining process Phase 6<br />6. Deployment<br />The deployment phase consolidates the results that the<br />Model produces in a form that is useable to the customer. It<br />could be that the data mining exercise was undertaken with<br />the aim of simply increasing the knowledge of the data, but<br />even in this restricted remit, and more generally, any<br />knowledge gained from the exercise must be presented in a<br />way that is of use to the customer. <br />Depending on the nature of the data mining project<br />undertaken, the deployment phase can vary from being<br />simply a report generated all the way through to<br />implementing a repeatable data mining process across the<br />enterprise. It is not unusual for the customer to perform the<br />deployment phase (as opposed to the data analyst), and in<br />either case it is important that the customer understands<br />the actions that need to be carried out in order to make<br />best use of the models created.<br />12<br />Copyright © 2010 Data-Decisions Ltd<br />
    26. 26. The most important model building techniques<br />A broad list of techniques:<br />Supervised Learning<br /><ul><li>Decision trees
    27. 27. Artificial neural nets
    28. 28. K-nearest neighbour
    29. 29. Support vectors
    30. 30. Linear regression
    31. 31. Logistic regression
    32. 32. Discriminant analysis
    33. 33. Genetic algorithms</li></ul>Unsupervised Learning<br /><ul><li>Clustering techniques
    34. 34. Artificial neural nets
    35. 35. Conceptual clustering
    36. 36. Genetic algorithms</li></ul>Decision-trees<br />Bayes<br />Clustering<br />13<br />Copyright © 2010 Data-Decisions Ltd<br />
    37. 37. How data mining models are built and applied<br />14<br />Copyright © 2010 Data-Decisions Ltd<br />
    38. 38. How models should be evaluated and monitored<br /><ul><li>Keeping data up to date is essential – data becomes obsolete quickly (within a year, even), and so re-evaluating models frequently with up to date data will help keep them accurate. This includes updating the data with the latest campaign data, response data etc., and reassessing error rates in the model with the new data to help.
    39. 39. Models need to be evaluated to see that the results produced are compatible with the project objectives.
    40. 40. No model is ever perfect, so should always be work-in-progress and subject to continuous on-going scheduled refinements and improvements.</li></ul>15<br />Copyright © 2010 Data-Decisions Ltd<br />
    41. 41. Conclusion<br />“Data mining is the process of finding patterns in your data which you can use to do your business better”<br />Data mining is a subset of a much larger sphere known as Business Intelligence, which includes data parsing, visualisation, OLAP and data warehousing<br />Advanced analytics encompasses Data Mining but also includes non-customer focussed activities that require mathematical and statistical approaches<br />CRISP is an established proven Data Mining framework<br />Key emphasis in Data Mining must be on understanding – also never underestimate the importance or amount of work involved in data mining<br />No model is ever perfect and is only the starting point for future iterative improvements<br />16<br />Copyright © 2010 Data-Decisions Ltd<br />
    42. 42. Web:<br /><ul><li>
    43. 43.
    44. 44.
    45. 45.</li></ul>References<br />Books (& recommended reading):<br /><ul><li>Data Mining (Witten & Frank)
    46. 46. Applied Data Mining: Statistical Methods for Business and Industry (Paolo Giudici)
    47. 47. Data Mining Techniques: for Marketing, Sales and Customer Relationship Management (Berry and Linoff)</li></ul>Tobie Muir (Managing Director)<br />E.<br />T. 0208 144 7422 /07903 525358<br />W.<br />17<br />Copyright © 2010 Data-Decisions Ltd<br />