Data pyramid Data Information Knowledge Wisdom Data + context Information + rules Knowledge + experience
Related Fields Statistics Machine Learning Databases Visualization Data Mining and Knowledge Discovery
Transformed Data Target Data RawData Knowledge Data Mining Transformation Interpretation & Evaluation Selection & Cleaning Integration Understanding Knowledge Discovery Process DATA Ware house Knowledge __ ____ __ ____ __ ____ Patterns and Rules
Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Definition of Data Mining
“… The non-trivial process of identifying valid , novel , potentially useful , and ultimately understandable patterns in data…”
Fayyad, Piatetsky-Shapiro, Smyth 
The Evolution of Data Analysis Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics Data Collection (1960s) "What was my total revenue in the last five years?" Computers, tapes, disks IBM, CDC Retrospective, static data delivery Data A ccess (1980s) "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level Data Warehousing & Decis ion Support (1990s) "What were unit sales in New England last March? Drill down to Boston." On - line analytic processing (OLAP), multidimensional databases, data warehouses SPSS, Comshare, Arbor, Cognos, Microstrategy,NCR Retrospective, dynamic data d elivery at multiple levels Data Mining (Emerging Today) "What’s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous s tartups Prospective, proactive information delivery
Need for Data Mining
Data accumulate and double every 9 months
There is a big gap from stored data to knowledge; and the transition won’t occur automatically.
Manual data analysis is not new but a bottleneck
Fast developing Computer Science and Engineering generates new demands
Seeking knowledge from massive data
Any personal experience?
When is DM useful
Data rich world
Large data (dimensionality and size)
Image data (size)
Gene chip data (dimensionality)
Little knowledge about data (exploratory data analysis)
What if we have some knowledge?
Increasing data dimensionality and data size
Various data forms
New data types
Streaming data, multimedia data
Efficient search and access to data/knowledge
Intelligent update and integration
Data Mining Survey
19% Financial Serv.
17% Tele/Data communication
21.4% Understanding Customer Segments and Preferences,
19,5% Identifying Profitable Customers and Acquiring New ones,
14,1% Increasing Revenue From Customers.
World Data Mining Survey, 6 August, 2002.
Results of Data Mining Include:
Forecasting what may happen in the future
Classifying people or things into groups by recognizing patterns
Clustering people or things into groups based on their attributes
Associating what events are likely to occur together
Sequencing what events are likely to lead to later events
Data Mining versus OLAP
OLAP - On-line Analytical Processing
Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
Data Mining Versus Statistical Analysis
Tests for statistical correctness of models
Are statistical assumptions of models correct?
Eg Is the R-Square good?
Is the relationship significant?
Use a t-test to validate significance
Tends to rely on sampling
Techniques are not optimised for large amounts of data
Requires strong statistical skills
Originally developed to act as expert systems to solve problems
Less interested in the mechanics of the technique
If it makes sense then let’s use it
Does not require assumptions to be made about data
Can find patterns in very large amounts of data
Requires understanding of data and business problem
Data Mining Taxonomy
- … predict the value of a particular attribute…
- … foundation of human-interpretable patterns that describe the data…
Data Mining Tasks...
Classification [ Predictive ]
Clustering [ Descriptive ]
Association Rule Discovery [ Descriptive ]
Sequential Pattern Discovery [ Descriptive ]
Deviation Detection [ Predictive ]
Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
Classification: Linear Regression
w 0 + w 1 x + w 2 y >= 0
Regression computes w i from data to minimize squared error to ‘fit’ the data
Not flexible enough
Classification: Decision Trees X Y if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue 5 2 3
-a way of representing a series of rules that lead to a class or value;
-basic components of a decision tree: decision node, branches and leaves;
Job>5 High Debt
Low Risk High Risk High Risk Low Risk
No Yes Yes No Yes No
Decision Trees (cont.)
handle very well non-numeric data;
work best when the predictor variables are categorical;
Example Decision Tree categorical categorical continuous class Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes The splitting attribute at a node is determined based on the Gini index.
Classification: Neural Networks
efficiently model large and complex problems;
may be used in classification problems or for regressions;
Starts with input layer => hidden layer => output layer
1 2 3 4 5 6 Inputs Output Hidden Layer
Neural Networks (cont.)
can be easily implemented to run on massively parallel computers;
can not be easily interpret;
require an extensive amount of training time;
require a lot of data preparation (involve very careful data cleansing, selection, preparation, and pre-processing);
require sufficiently large data set and high signal-to noise ratio.
seeks to describe dataset in terms of natural clusters of cases
Classification Example categorical categorical continuous class Training Set Learn Classifier Test Set Model
Sky Survey Cataloging
Data Mining Tasks: Clustering
Goal is to identify categories
Natural grouping of customers by processing all the available data about them.
market segmentation, discovering affinity groups, and defect analysis
Data Mining Tasks: Association Rule Discovery
Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
… discovering most significant changes in data from previously measured or normative values…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Pattern Discovery:
… process of looking for patterns and rules that predict strong sequential dependencies among different events…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Identify frequently occurring sequences from given records
40 percent of female customers buy a gray skirt six months after buying a red jacket
Data Mining Methodology: SAS
Extract a portion of the dataset for data mining
create, select and transform variables with the intention of building a model
Specify a relationship of variables that reliably predicts a desired goal
Evaluate the practical value of the findings and the model resulting from the data mining effort
Data Mining Methodology: CRISP-DM
Phases and Tasks Business Understanding Data Understanding Evaluation Data Preparation Modeling Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques Collect Initial Data Initial Data Collection Report Describe Data Data Description Report Explore Data Data Exploration Report Verify Data Quality Data Quality Report Data Set Data Set Description Select Data Rationale for Inclusion / Exclusion Clean Data Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model Assessment Revised Parameter Settings Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation Deployment
Major Application Areas for Data Mining Solutions
Fraud/Non-Compliance Anomaly detection
Isolate the factors that lead to fraud, waste and abuse
Target auditing and investigative efforts more effectively