Upcoming SlideShare
×

# Multivariate Statistical Analysis for Data Mining

1,077 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,077
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
51
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Multivariate Statistical Analysis for Data Mining

1. 1. Introduction to Data  Mining Kwok-Leung Tsui Industrial & Systems Engineering Georgia Institute of Technology 1/5/2009 1
2. 2. What is Data Mining  Data Flood !! • Data mining is – extraction of meaningful/useful/interesting  patterns from a large volume of data sources  (signal, image, time series, image, transaction,  text, web, etc.)   Data mining is one of top ten emerging  technology! ( MIT’s Technology Review, 2004) 1/5/2009 2
3. 3. DM Fields & Backgrounds • Data mining is an emerging multi‐disciplinary  field: – Statistics (especially, multivariate statistics) – Machine learning – Application Background (e.g., Biology…) – Pattern recognition – Databases – Visualization – OLAP and data warehousing – etc. 1/5/2009 3
4. 4. Commonly Used Language in Data Mining • Data Mining = DM • Knowledge Discovery in Database = KDD • Massive Data Sets = MD (Very Large Data Base = VLDB) • Data Analysis = DA 1/5/2009 4
5. 5. Data Mining DM ≠ MD DM ≠ DA DA+MD = DM ? • Statistical DM: – Computationally feasible algorithms. – Little or no human intervention. • Money issue: – DA software (~\$ 5-10K), DM software(~ \$100K) 1/5/2009 5
6. 6. Statistical Data Mining • Data Mining is exploratory data analysis with little or no human interaction using computationally feasible techniques, i.e., the attempt to find unknown interesting structure. <source: Ed Wegman> 1/5/2009 6
7. 7. Data Mining • Data mining is the process of exploration and analysis, by automatic or semi- automatic means, of large quantities of data in order to discover meaningful patterns and rules. <source: Mastering Data Mining by Berry and Linoff, 2000> 1/5/2009 7
8. 8. KDD • Knowledge discovery in databases (KDD) is a multi-disciplinary research field for non-trivial extraction of implicit, previously unknown, and potentially useful knowledge from data. <source: Data Mining by Adriaans and Zantinge, 1996> 1/5/2009 8
9. 9. KDD & Data Mining • KDD Process (DM Process) – The process of using data mining methods (algorithms) to extract knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations • Data Mining (& Modeling) – A step in the knowledge discovery process consisting of particular algorithms(methods) that, under some acceptable objective, produces particular patterns or knowledge over the data. • Text mining, web mining, etc. • Some people treat DM and KDD equivalently. 1/5/2009 9
10. 10. Data Mining, Statistics, CS Data Miners Computer Statisticians Scientists Extract useful Support data mining by Support data mining by information from large mathematical theory and computational algorithm amount of raw data statistical methods and relevant software Friedman 1/5/2009 10
11. 11. Applications • Bioinformatics  • Sales and Marketing • Health Care / Medical Diagnosis • Supply Chain Management • Process Control • Network Intrusion Detection • Astronomy  • Sports and Entertainment 1/5/2009 11
12. 12. Examples of DM Applications • Finance: Forecast stock price or movement using neural network or time series • Telecom: Predict churn rate and customer usage using tree, logistic regression, and activity monitoring • Retail: Identify cross selling using association rules, e.g. Market Basket Analysis, RFM Analysis • Pharmaceutical: Segment customers into different behavior groups using clustering and classification • Banking: Customer relationship management (CRM) using clustering and association 1/5/2009 12
13. 13. Examples of DM Applications • Hotel/airline: Identify potential customers for promotion offers using tree or neural network • Ocean terminal operation data mining for efficiency improvement • Sales/demand data mining for inventory planning • UPS transaction data mining for mail box location design 1/5/2009 13
14. 14. Prerequisites for Data Mining • Large amount of data (internal & ext.) (called “data warehouse”, “data mart”, etc.) – Phone calls, web visits, supermarket transactions, weather data etc. – Mega-, giga-, tera-bytes, …. – Information technology advancement – Most companies have these resources Friedman 1/5/2009 14
15. 15. Prerequisites for Data Mining • Advanced computer technology (big CPU, parallel architecture, etc.) – allow fast access to vast amount of data – allow computationally intensive algorithm and statistical methods • Knowledge in business or subject matter – ask important business questions – understand and verify discovered knowledge 1/5/2009 15
16. 16. Data Mining Process 1/5/2009 16
17. 17. Data Mining (KDD) Process Determine Business Objectives Data Preparation Mining & Modeling Consolidation and Application 1/5/2009 17
18. 18. Formulate Business Objectives Data Preparation Formulate Business Objectives Data Mining Consolidation and Application • Examples of a telecom company: – Identify important customer traits to keep profitable customers and to predict fraudulent behavior, credit risks and customer churn – Improve programs in target marketing, marketing channel management, micro-marketing, and cross selling – Meet effectively the challenges of new product development 1/5/2009 18
19. 19. Formulate Business Objectives Data Data Preparation Preparation Data Mining Consolidation and Application Business Source • Legacy systems Objectives Systems • External systems Model Discovery File Identify Data Extract data Cleanse and Needed and From source Aggregate Sources systems data Model Evaluation File 1/5/2009 19
20. 20. Formulate Business Objectives Data Preparation Mining & Modeling Data Mining Consolidation and Application Model Explore Construct Discovery Data Model File Ideas Model Transform Evaluate Reports Evaluation Model into Model Usable Format File Models 1/5/2009 20
21. 21. Formulate Business Objectives Data Consolidation and Application Preparation Data Mining Consolidation and Application Ideas Communicate Make Business Reports Knowledge Extract /Transport Decisions and Knowledge Database Knowledge Improve model Models 1/5/2009 21
22. 22. Effort in Data Mining 60 50 40 Effort (%) 30 20 10 0 Business Data Preparation Data Mining Consolidate Discovered Objective Knowledge Determination 1/5/2009 22
23. 23. Data Preparation 1/5/2009 23
24. 24. Databases & Data Warehouses • Relational Database • Object Oriented Database • Transactional Database • Time Series, Spatial Database • Data Warehouse, Data Cube • SQL = Structured Query Language • OLAP = On-Line Analytical Processing • MOLAP = Multidimensional OLAP • Fundamental data object for MOLAP is the Data Cube • ROLAP = Relational OLAP using extended SQL 1/5/2009 24
25. 25. Data Preparation • Sources of Noises – Faulty data collection instruments, e.g., sensors – Transmission errors, e.g., intermittent errors from satellite or internet transmissions – Data entry error – Technology limitations error – Naming conventions misused, e.g., same names but different meaning – Incorrect classification 1/5/2009 25
26. 26. Data Preparation • Redundant data – Variables have different names in different databases – Raw variable in one database is a derived variable in another – Changes in variable over time not reflected in database • Irrelevant variables destroy speed (dimension reduction needed) 1/5/2009 26
27. 27. Data Preparation • Problem of Missing Data – Missing values in massive data sets • Missing data may be irrelevant to desired result • Massive data sets if acquired by instrumentation may have few missing values • Impute missing data manually or statistically • Missing Value Plot • Traditional methods limited for small data sets 1/5/2009 27
28. 28. Data Preparation • Problem of Outliers – Outliers easy to detect in low dimensions – A high dimensional outliers may not show up in low dimensional projections – Clustering and other statistical modeling can be used – Fisher Info Matrix and Convex Hull Peeling more feasible but still too complex for Massive datasets (Generally difficult for massive data) – Traditional methods limited for small data sets 1/5/2009 28
29. 29. Data Cleaning • Duplicate removal (tool based) • Missing value imputation (manual and statistical) • Identify and remove data inconsistencies • Identify and refresh stale data • Create unique record (case) ID 1/5/2009 29
30. 30. Database Sampling • The KDD systems must be able to assist in the selection of appropriate parts of the databases to be examined • For sampling to work, the data must satisfy certain conditions(e.g., no systematic biases) • Sampling can be very expensive operation especially when the sample is taken from data stored in some databases 1/5/2009 30
31. 31. Database Reduction • Data Cube aggregation • Dimension reduction • Eliminate irrelevant and redundant attributes • Data Compression • Encoding mechanisms, quantizations, wavelet transformation, principle components, etc. 1/5/2009 31
32. 32. Data Preparation Using R 1/5/2009 32
33. 33. Introduction to R • A software package suitable for data analysis and graphic representation. • Free and open source. • Implementation of many modern statistical methods. • Flexible and customizable. 1/5/2009 33
34. 34. Software Download of R • Go to http://www.cran.r-project.org/ • Click on Windows (95 and later) • Click on base • Click on R-2.3.0-win32.exe • Press Save to download the file to your computer. • Install R by doubling click on the downloaded file and following the instructions. 1/5/2009 34
35. 35. Using R • To invoke R, – Go to Start ! Programs ! R • To quit R, – Type q( ) at the R prompt (>) and press Enter key. – Or simply close the window. 1/5/2009 35
36. 36. Using R • A good introduction to R is available at http://www.cran.r-project.org/ – Click on Manuals under Documentations. – Videos on R: wwww.decisionsciencenews.com/?p=261 • A few important commands – help.start() for a web-based interface to the help system. – Help(topic) or ?topic for help on topic. – help.search(“pattern”) for help pages containing pattern. 1/5/2009 36
37. 37. Data Preparation Data Matrix 1/5/2009 37
38. 38. Data Preparation Variable Types 1/5/2009 38
39. 39. Data Preparation Missing Values 1/5/2009 39
40. 40. Data Preparation Missing Values 1/5/2009 40
41. 41. Data Preparation 1/5/2009 41
42. 42. Data Preparation 1/5/2009 42
43. 43. Data Preparation Data Dissimilarities (Distances) 1/5/2009 43
44. 44. Data Preparation Data Dissimilarities (Distances) 1/5/2009 44
45. 45. Data Preparation Data Dissimilarities (Distances) 1/5/2009 45
46. 46. Data Preparation Data Dissimilarities (Distances) 1/5/2009 46
47. 47. Data Preparation Data Scaling of Mixed-Type Data 1/5/2009 47
48. 48. Data Preparation Outlier Detection Grubb’s test (z-score rule, non-resistant): 1/5/2009 48
49. 49. Data Preparation Outlier Detection • Other Rules: – CV Rule: • CV (coefficient of variation) = SD/mean • Call for outlier if CV exceeds certain threshold. – Resistant Rule: • Resistant Score = (X – median )/ MAD • MAD = Median absolute deviation • Call for outlier if Score exceeds certain threshold (say 5) 1/5/2009 49
50. 50. Data Preparation Mahalanobis Distance (multi-dimensional data) 1/5/2009 50
51. 51. Data Preparation Home equity loan example: 1/5/2009 51
52. 52. Data Preparation 1. Download data: 2. Remove cases with missing data: 1/5/2009 52
53. 53. Data Preparation 3. Remove outliers. 3a. Separate the data into two groups 3b. Compute Mahalanobis distance for each group: 1/5/2009 53
54. 54. Data Preparation 3c. Remove outliers in each of the two groups: 3d. Combine the two groups of data & write to file: 1/5/2009 54
55. 55. Data Preparation Plots of Mahalanobis distance: 1/5/2009 55