Data Mining Lecture 1: Introduction to Data Mining


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining Lecture 1: Introduction to Data Mining

  1. 1. Data Mining <ul><ul><li>Lecture 1: </li></ul></ul><ul><ul><li>Introduction to Data Mining </li></ul></ul><ul><ul><li>Manuel Penaloza, PhD </li></ul></ul>
  2. 2. Introduction to Data Mining <ul><li>Society produces huge amounts of data daily </li></ul><ul><ul><li>Retail Store </li></ul></ul><ul><ul><ul><li>POS data on customer purchases </li></ul></ul></ul><ul><ul><li>Banks </li></ul></ul><ul><ul><ul><li>Collection of customer service calls </li></ul></ul></ul><ul><ul><li>Telecommunications </li></ul></ul><ul><ul><ul><li>Phone call records (mobile and house-based calls) </li></ul></ul></ul><ul><ul><li>Medicine </li></ul></ul><ul><ul><ul><li>Genomic data collected on the structure of genes </li></ul></ul></ul><ul><ul><li>Government </li></ul></ul><ul><ul><ul><li>Law enforcement data, income tax data </li></ul></ul></ul><ul><ul><li>Others: (Transactional) data from Sports, Schools, Research, Search engines, etc. </li></ul></ul>
  3. 3. What is Data Mining (DM)? <ul><li>It is the process of discovering hidden relationships and patterns in large data sets </li></ul><ul><ul><li>It can also predict the outcome of a future observation </li></ul></ul><ul><li>Data mining is an interdisciplinary field </li></ul><ul><ul><li>It is an extension to statistical analysis </li></ul></ul><ul><ul><li>It uses techniques from: </li></ul></ul><ul><ul><ul><li>Statistics </li></ul></ul></ul><ul><ul><ul><li>Machine learning </li></ul></ul></ul><ul><ul><ul><li>Pattern recognition </li></ul></ul></ul><ul><ul><ul><li>Database technology </li></ul></ul></ul><ul><ul><ul><li>Visualization </li></ul></ul></ul><ul><ul><ul><li>High-performance computing </li></ul></ul></ul>
  4. 4. Questions answered by DM <ul><li>Extracting useful information from a dataset that answer: </li></ul><ul><ul><li>Which CC customers are most profitable? </li></ul></ul><ul><ul><li>Which loan applicants are high-risk? </li></ul></ul><ul><ul><li>Which customer will respond to a planned promotion? </li></ul></ul><ul><ul><li>How do we detect phone card fraud? </li></ul></ul><ul><ul><li>How do customer profile change over time? </li></ul></ul><ul><ul><li>Which customers do prefer product A over product B? </li></ul></ul><ul><ul><li>What is the revenue prediction for next year? </li></ul></ul><ul><ul><li>Which students are most likely to transfer than others? </li></ul></ul><ul><ul><li>Which tax payer may be cheating the system? </li></ul></ul><ul><ul><li>Who is most likely to violate a probation sentence? </li></ul></ul><ul><ul><li>What is the predicted outcome for some treatment? </li></ul></ul>
  5. 5. Data sources <ul><li>Relational Databases </li></ul><ul><ul><li>Transactional data with many tables </li></ul></ul><ul><li>Data warehouses </li></ul><ul><ul><li>Historical data, aggregated and updated periodically </li></ul></ul><ul><li>Files </li></ul><ul><ul><li>In special format (e.g., CSV) or proprietary binary </li></ul></ul><ul><li>Internet or electronic mail </li></ul><ul><ul><li>HTML, XML, web search results, e-mails </li></ul></ul><ul><li>Scientific, research </li></ul><ul><ul><li>Seismology, remote sensing, etc. </li></ul></ul>
  6. 6. Example: Health System <ul><li>Characteristics of the Health System: </li></ul><ul><ul><li>Personal medical records (GP, specialists, etc.) </li></ul></ul><ul><ul><li>Billing records </li></ul></ul><ul><ul><li>Hospital data (surgery, admission, etc.) </li></ul></ul><ul><li>Questions: </li></ul><ul><ul><li>Are MD's following the procedures? </li></ul></ul><ul><ul><li>Which patient may have an adverse drug reactions? </li></ul></ul><ul><ul><li>Are people committing frauds? </li></ul></ul><ul><ul><li>Which patient are most likely to get cancer? </li></ul></ul>
  7. 7. Case study: E-commerce <ul><li>A person buys book from </li></ul><ul><li>Objective: Recommend other books this person is likely to buy </li></ul><ul><li>Amazon may do clustering or sequential pattern analysis based on books bought by other people </li></ul><ul><li>Data analyzed: </li></ul><ul><ul><li>“Customer who bought “Data Mining: Practical Machine Learning Tools and Techniques” also bought “Introduction to Data Mining” </li></ul></ul><ul><li>Recommendations have been successful for Amazon </li></ul><ul><ul><li>Increasing buyer’s satisfaction and purchases </li></ul></ul>
  8. 8. What motivated data mining? <ul><li>Growth in data collection </li></ul><ul><li>Presence of data warehouses with reliable data </li></ul><ul><li>Competitive pressure to increase sales </li></ul><ul><li>The development of commercial off the shelves (COTS) data mining software </li></ul><ul><ul><li>Examples: XLMiner, Insightful Miner, SAS, SPSS </li></ul></ul><ul><li>Growth of computing power and storage capacity </li></ul><ul><li>High dimensionality of the data </li></ul><ul><li>Heterogeneous and complex data </li></ul><ul><li>Limitation of humans </li></ul>
  9. 9. Insightful Miner TM 7: GUI *Figures taken from the Insightful Miner 7 Guide
  10. 10. Creating Models <ul><li>Create a network of pipelined components </li></ul><ul><ul><li>By dragging and dropping components </li></ul></ul>
  11. 11. Choosing a data mining system <ul><li>They have different functionality or methodology </li></ul><ul><li>Selection determined by: </li></ul><ul><ul><li>Type of operating system used in your organization </li></ul></ul><ul><ul><li>The data sources handle by the tool: </li></ul></ul><ul><ul><ul><li>ASCII text files, relational databases, XML data </li></ul></ul></ul><ul><ul><li>The data mining functions and methods offered </li></ul></ul><ul><ul><li>Scalability of the system </li></ul></ul><ul><ul><ul><li>Row and column scalability </li></ul></ul></ul><ul><ul><li>Visualization tools available </li></ul></ul><ul><ul><li>Graphical user interface that guides the execution of the methods </li></ul></ul><ul><ul><li>Integration with other information systems </li></ul></ul><ul><ul><li>Cost and performance </li></ul></ul>
  12. 12. Data Mining in Databases <ul><li>Current applications include data mining modules </li></ul><ul><li>Example: </li></ul><ul><ul><li>Database management systems such as Oracle and MS SQL Server </li></ul></ul><ul><ul><li>CRM (Customer Relationship Management) </li></ul></ul><ul><li>Advantages for Database systems: </li></ul><ul><ul><li>One Stop shopping </li></ul></ul><ul><ul><li>Minimize data movement and conversion </li></ul></ul><ul><li>Disadvantages for Database systems: </li></ul><ul><ul><li>Limited to DM methods available in the system </li></ul></ul><ul><ul><li>Data extractions and transformations may not be powerful enough </li></ul></ul>
  13. 13. Standard data mining life cycle <ul><li>CRISP (Cross-Industry Standard Process) </li></ul><ul><li>It is an iterative process with phase dependencies </li></ul><ul><li>IT consists of six (6) phases: </li></ul>see for more information
  14. 14. CRISP_DM <ul><li>Cross-industry standard developed in 1996 </li></ul><ul><ul><li>Analysts from SPSS/ISL, NCR, Daimler-Benz, OHRA </li></ul></ul><ul><li>Funding from European Commission </li></ul><ul><li>Important Characteristics: </li></ul><ul><ul><li>Non-proprietary </li></ul></ul><ul><ul><li>Application/Industry neutral </li></ul></ul><ul><ul><li>Tool neutral </li></ul></ul><ul><ul><li>General problem-solving process </li></ul></ul><ul><ul><li>Process with six phases but missing: </li></ul></ul><ul><ul><ul><li>Saving results and updating the model </li></ul></ul></ul>
  15. 15. CRISP-DM Phases (1) <ul><li>Business Understanding </li></ul><ul><ul><li>Understand project objectives and requirements </li></ul></ul><ul><ul><li>Formulation of a data mining problem definition </li></ul></ul><ul><li>Data Understanding </li></ul><ul><ul><li>Data collection </li></ul></ul><ul><ul><li>Evaluate the quality of the data </li></ul></ul><ul><ul><li>Perform exploratory data analysis </li></ul></ul><ul><li>Data Preparation </li></ul><ul><ul><li>Clean, prepare, integrate, and transform the data </li></ul></ul><ul><ul><li>Select appropriate attributes and variables </li></ul></ul>
  16. 16. CRISP-DM Phases (2) <ul><li>Modeling </li></ul><ul><ul><li>Select and apply appropriate modeling techniques </li></ul></ul><ul><ul><li>Calibrate model parameters to optimize results </li></ul></ul><ul><ul><li>If necessary, return to data preparation phase to satisfy model's data format </li></ul></ul><ul><li>Evaluation </li></ul><ul><ul><li>Determine if model satisfies objectives set in phase 1 </li></ul></ul><ul><ul><li>Identify business issues that have not been addressed </li></ul></ul><ul><li>Deployment </li></ul><ul><ul><li>Organize and present the model to the “user” </li></ul></ul><ul><ul><li>Put model into practice </li></ul></ul><ul><ul><li>Set up for continuous mining of the data </li></ul></ul>
  17. 17. Data mining tasks (1) <ul><li>Classification </li></ul><ul><ul><li>Predict the categorical value of a target (dependent) variable based on the values of other attributes </li></ul></ul><ul><ul><li>Target variable is partitioned into classes </li></ul></ul><ul><ul><li>It predicts class membership of a new observation </li></ul></ul><ul><ul><li>Examples: Which drug should be prescribed for older patients with low sodium/potassium ratios? </li></ul></ul><ul><li>Estimation </li></ul><ul><ul><li>Similar to classification except target variable is numeric </li></ul></ul><ul><ul><li>That is, predicting a numeric value </li></ul></ul><ul><ul><li>Example: Estimate the blood pressure of a person based on his/her age, gender, body mass index, etc. </li></ul></ul>
  18. 18. Data mining tasks (2) <ul><li>Prediction </li></ul><ul><ul><li>Similar to estimation except that results lie in the future </li></ul></ul><ul><ul><li>Example: Predict the price of a stock 3 months into the future </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Grouping similar records together </li></ul></ul><ul><ul><li>Example: Find patients with similar profiles </li></ul></ul><ul><li>Associations </li></ul><ul><ul><li>Uncover rules that indicates the association between two or more attributes </li></ul></ul><ul><ul><li>Find out which items are purchased together </li></ul></ul>
  19. 19. Task: Classification <ul><li>Build a model that learns to predict the class from pre-labeled instances or observations </li></ul><ul><ul><li>Many approaches: Regression, Decision Trees, Neural Networks </li></ul></ul>Given a set of points from classes what is the class of new point ? * Diagram taken from
  20. 20. Task: Clustering <ul><li>Find grouping of instances given un-labeled data </li></ul>* Diagram taken from
  21. 21. DM looks easy Data Data Mining Method Regression Decision Tree Neural Network … Association Rules Model <ul><li>But it is not easy </li></ul><ul><li>Real-world is complicate </li></ul>
  22. 22. Methods and Techniques <ul><li>Cluster Analysis (tasks: clustering) </li></ul><ul><li>Association Rules (tasks: association) </li></ul><ul><li>Decision trees (tasks: prediction, classification) </li></ul><ul><li>Neural networks (tasks: prediction, classification) </li></ul><ul><li>K-nearest neighbor (tasks: prediction, classification, clustering) </li></ul><ul><li>Regression analysis (task: estimation, prediction) </li></ul><ul><li>Confidence interval estimation (task: estimation) </li></ul>
  23. 23. Fallacies of Data Mining (1) <ul><li>Fallacy 1: There are data mining tools that automatically find the answers to our problem </li></ul><ul><ul><li>Reality: There are no automatic tools that will solve your problems “while you wait” </li></ul></ul><ul><li>Fallacy 2: The DM process require little human intervention </li></ul><ul><ul><li>Reality: The DM process require human intervention in all its phases, including updating and evaluating the model by human experts </li></ul></ul><ul><li>Fallacy 3: Data mining have a quick ROI </li></ul><ul><ul><li>Reality: It depends on the startup costs, personnel costs, data source costs, and so on </li></ul></ul>
  24. 24. Fallacies of Data Mining (2) <ul><li>Fallacy 4: DM tools are easy to use </li></ul><ul><ul><li>Reality: Analysts must be familiar with the model </li></ul></ul><ul><li>Fallacy 5: DM will identify the causes to the business problem </li></ul><ul><ul><li>Reality: DM tool only identify patterns in your data, analysts must identify the cause </li></ul></ul><ul><li>Fallacy 6: Data mining will clean up a data repository automatically </li></ul><ul><ul><li>Reality: Sequence of transformation tasks must be defined by an analysts during early DM phases </li></ul></ul><ul><ul><li>* Fallacies described by Jen Que Louie, President of Nautilus Systems, Inc. </li></ul></ul>
  25. 25. In summary, <ul><li>Problems suitable for Data Mining: </li></ul><ul><ul><li>Require to discover knowledge to make right decisions </li></ul></ul><ul><ul><li>Current solutions are not adequate </li></ul></ul><ul><ul><li>Expected high-payoff for the right decisions </li></ul></ul><ul><ul><li>Have accessible, sufficient, and relevant data </li></ul></ul><ul><ul><li>Have a changing environment </li></ul></ul><ul><li>IMPORTANT: </li></ul><ul><ul><li>ENSURE privacy if personal data is used! </li></ul></ul><ul><ul><li>Not every data mining application is successful! </li></ul></ul>
  26. 26. Main References <ul><li>Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques , 2 nd edition, Morgan Kaufmann Publishers </li></ul><ul><li>Daniel LaRose. Discovering Knowledge in Data: An Introduction to Data Mining , Wiley Publication </li></ul><ul><li>Pang-Ning Tang et. al. Introduction to Data Mining , Addison Wesley </li></ul><ul><li>Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques , Morgan Kaufmann Publishers </li></ul><ul><li>Online data mining course offered by KDnuggets TM at </li></ul><ul><li>Engineering Statistics Handbook available online at </li></ul>
  27. 27. Exercise #1 <ul><li>CRISP-DM is not the only DM process, do a quick search on the Internet for another process. Describe any similarity and differences with CRISP-DM. </li></ul><ul><li>Determine how data mining could help a web search engine company like Google in its operation? </li></ul><ul><ul><li>Identify one or more objectives. </li></ul></ul><ul><ul><li>Which data mining task(s) could help this company? </li></ul></ul>