visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Data Warehouses: Data Cube OLAP all Europe North_America Mexico Canada Spain Germany Vancouver M. Wind L. Chan ... ... ... ... ... ... all region office country Toronto Frankfurt city
Data Warehouses: Data Cube OLAP Product Region Month Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Category Country Quarter Product City Month Week Office Day
Data Warehouses: Data Cube OLAP Total annual sales of TV in U.S.A. Date Product Country All, All, All sum sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions
Association ( correlation and causality)
Multi-dimensional vs. single-dimensional association
Interestingness measures : A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful , novel, or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures:
Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
Can We Find All and Only Interesting Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find all the interesting patterns?
Association vs. classification vs. clustering
Search for only interesting patterns: Optimization
Can a data mining system find only the interesting patterns?
First general all the patterns and then filter out the uninteresting ones.
Generate only the interesting patterns — mining query optimization
Data Preprocessing Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data
Major Tasks in Data Preprocessing
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files
Normalization and aggregation
Obtains reduced representation in volume but produces the same or similar analytical results
Part of data reduction but with particular importance, especially for numerical data
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (assuming the tasks in classification — not effective when the percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter
Forecast the missing value : use the most probable value Vs. use of the value with less impact on the further analysis
How to Handle Noisy Data?
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries , etc.
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
smooth by fitting the data into regression functions
combines data from multiple sources into a coherent store
integrate metadata from different sources
Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from different sources are different
possible reasons: different representations, different scales, e.g., metric vs. British units
Smoothing: remove noise from data
Data reduction - aggregation: summarization, data cube Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range
Scales: nominal, order and interval scales
normalization by decimal scaling
New attributes constructed from the given ones
Data Reduction Strategies
Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set
Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation
Discretization and concept hierarchy generation
Mining Association Rules in Large Databases
Association rule mining
Mining single-dimensional Boolean association rules from transactional databases
Mining multilevel association rules from transactional databases
Mining multidimensional association rules from transactional databases and data warehouse
From association mining to correlation analysis
Constraint-based association mining
What Is Association Mining?
Association rule mining:
Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.