Department of Computer Science University of Wisconsin – Eau Claire Eau Claire, WI 54701 [email_address] 715-836-2526 Introduction to Data Mining Michael R. Wick Professor and Chair
Acknowledgements Some of the material used in this talk is drawn from: Dr. Jiawei Han at University of Illinois at Urbana Champaign Dr. Bhavani Thuraisingham (MITRE Corp. and UT Dallas) Dr. Chris Clifton, Indiana Center for Database Systems, Purdue University
Road Map Definition and Need Applications Process Types  Example: The Apriori Algorithm State of Practice Related Techniques Data Preprocessing
What Is Data Mining? Data mining (knowledge discovery from data)  Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? (Deductive) query processing.  Expert systems or small learning programs
What is Data Mining? Real Example from the NBA Play-by-play information recorded by teams Who is on the court Who shoots Results Coaches want to know what works best Plays that work well against a given team Good/bad player matchups Advanced Scout  (from IBM Research) is a data mining tool to answer these questions http://www.nba.com/news_feat/beyond/0126.html Starks+Houston+Ward playing
Necessity for Data Mining Large amounts of current and historical data being stored Only small portion (~5-10%) of collected data is analyzed Data that may never be analyzed is collected in the fear that something that may prove important will be missed As databases grow larger, decision-making from the data is not possible; need knowledge derived from the stored data Data sources Health-related services, e.g., benefits, medical analyses Commercial, e.g., marketing and sales Financial Scientific, e.g., NASA, Genome DOD and Intelligence Desired analyses Support for planning (historical supply and demand trends) Yield management (scanning airline seat reservation data to maximize yield per seat) System performance (detect abnormal behavior in a system) Mature database analysis (clean up the data sources)
Necessity Is the Mother of Invention Data explosion problem  Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories  We are drowning in data, but starving for knowledge!   Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
Potential Applications Data analysis and decision support Market analysis and management Target marketing, customer relationship management (CRM),  market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection   Finding outliers in credit card purchases Other Applications Text mining (news group, email, documents) and Web mining Stream data mining DNA and bio-data analysis
Data Mining: Confluence of Multiple Disciplines   Data Mining Database  Systems Statistics Other Disciplines Algorithm Machine Learning Visualization
Knowledge Discovery in Databases:  Process adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining:  An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Knowledge Data Target Data Selection Preprocessed Data Patterns Data Mining Interpretation/ Evaluation Preprocessing
Steps of a KDD Process   Learning the application domain relevant prior knowledge and goals of application Creating a target data set:  data selection Data  cleaning : (may take 60% of effort!) Data  reduction  and  transformation Find useful features, dimensionality/variable reduction, invariant representation. Choosing  methods  of data mining  summarization, classification, regression, association, clustering. Choosing the mining  algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
Data Mining and Business Intelligence   Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Multiple Perspectives in Data Mining Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, Web mining, etc.
Ingredients of an Effective KDD Process Background Knowledge Goals for Learning Knowledge Base Database(s) Plan  for Learning Discover Knowledge Determine Knowledge Relevancy Evolve Knowledge/ Data Generate and Test Hypotheses Visualization and Human Computer Interaction Discovery Algorithms “ In order to discover anything, you must be looking for something.”  Murphy’s 1 st  Law of Serendipity
What Can Data Mining Do? Clustering Identify previously unknown groups Classification Give operational definitions to categories Association Find Association rules Many others…
Clustering Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a  stand-alone tool  to get insight into data distribution  As a  preprocessing step  for other algorithms
Some Clustering Approaches Iterative Distance-based Clustering Specify in advance the number of desired clusters ( k ) K random points chosen as cluster centers Instances assigned to closest center Centroid (or mean) of all points in cluster is calculated Repeat until clusters are stable Incremental Clustering Uses tree to represent clusters Nodes represent clusters (or subclusters) Instances added one by one and tree updated Updating can involve simple placement of instance in cluster or re-clustering Uses category utility function to determine if instance fits with each cluster Can result in merging or splitting of existing clusters Category Utility Uses quadratic loss function of conditional probabilities Does the addition of new instance help us better predict the value of attributes for other instances?
General Applications of Clustering  Pattern Recognition Spatial Data Analysis  create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns
Examples of Clustering Applications Marketing:  Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use:  Identification of areas of similar land use in an earth observation database Insurance:  Identifying groups of motor insurance policy holders with a high average claim cost City-planning:  Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies:  Observed earth quake epicenters should be clustered along continent faults
Classification (vs Prediction) Classification:   predicts categorical class labels (discrete/nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Learns operational definition Prediction:   models continuous-valued functions, i.e., predicts unknown or missing values  Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis
Classification—A Two-Step Process   Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formula Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’  Training Data Classifier (Model)
Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) Tenured? Classifier Testing Data Unseen Data
Classification Approaches Divide and Conquer Results in decision tree Uses “information gain” function Covering - Select category for which to learn rule - Add conditions on rule until “good enough”
Association Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] Motivation: finding regularities in data What products were often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
Why Is Association Mining Important? Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) Broad applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis Web log (click stream) analysis, DNA sequence analysis, etc.
Basic Concepts: Association Rules Itemset X={x 1 , …, x k } Find all the rules  X  Y   with min confidence and support support,  s , probability that a transaction contains X  Y confidence,  c,  conditional probability that a transaction having X also contains  Y . Let  min_support = 50%,  min_conf  = 50%: A    C  (50%, 66.7%) C    A  (50%, 100%) B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id Customer buys diaper Customer buys both Customer buys beer
Mining Association Rules: Example For rule  A      C : support = support({ A }  { C }) = 50% confidence = support({ A }  { C })/support({ A }) =   66.6% Min. support 50% Min. confidence 50% B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id 50% {A, C} 50% {C} 50% {B} 75% {A} Support Frequent pattern
Apriori: A Candidate Generation-and-test Approach Any subset of a frequent itemset must be frequent if  {beer, diaper, nuts}  is frequent, so is  {beer, diaper} Every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Apriori pruning principle : If there is any itemset which is infrequent, its superset should not be generated/tested! Method:  generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB Performance studies show its efficiency and scalability
The Apriori Algorithm—A Mathematical Definition Let I = {a,b,c,…} be a  set  of all items in the domain Let T = {  S  |  S     I } be a  set  of all transaction records of item sets Let support( S ) =    { A  |  A     T     S      A } | Let L 1  = { { a } |  a     I    support({ a })    minSupport }  k  ( k  > 1    L k -1       ) Let  L k  = {  S i      S j   | ( S i     L k -1 )    ( S j     L k -1 )     ( | S i  –  S j | = 1 )    ( | S j   –  S i | = 1)   (   S [ (( S      S i      S j )    (| S | =  k -1))    S    L k -1 ] )   ( support( S i      S j )    minSupport )  Then, the set of all frequent item sets is given by L =  L k and the set of all association rules is given by R = {  A      C  |  A       (L k )    ( C  = L k  –  A)     ( A       )    ( C       )  support(L k ) / support( A )    minConfidence }
The Apriori Algorithm—An Example Example:  minSupport = 2   I=  {Table Saw, Router, Kreg Jig, Sander, Drill Press} T=  { {Table Saw, Router,  Drill Press}, {  Router,  Sander  }, {  Router, Kreg Jig  }, {Table Saw, Router,  , Sander  }, {Table Saw,  , Kreg Jig  }, {  Router, Kreg Jig  }, {Table Saw,  , Kreg Jig  }, {Table Saw, Router, Kreg Jig,  , Drill Press}, {Table Saw, Router, Kreg Jig  }  } L1 = { {T}, {R}, {K}, {S}, {D} } L2 = { {R,T}, {K,T}, {D,T}, {K,R}, {R,S}, {D,R} } L3 = { {K,R,T}, {D,R,T} } L4 =   Rules = ????
The Apriori Algorithm Pseudo-code : C k : Candidate itemset of size k L k  : frequent itemset of size k L 1  = {frequent items}; for   ( k  = 1;  L k  !=  ;  k ++)  do begin C k+1  = candidates generated from  L k ; for each  transaction  t  in database  do increment the count of all candidates in  C k+1   that are contained in  t L k+1   = candidates in  C k+1  with min_support end return    k   L k ;
Important Details of Apriori How to generate candidates? Step 1: self-joining  L k Step 2: pruning How to count supports of candidates? Example of Candidate-generation L 3 = { abc, abd, acd, ace, bcd } Self-joining:  L 3 *L 3 abcd  from  abc  and  abd acde  from  acd  and  ace Pruning: acde  is removed because  ade  is not in  L 3 C 4 ={ abcd }
State of Commercial/Research Practice Increasing use of data mining systems in financial community, marketing sectors, retailing Still have major problems with large, dynamic sets of data (need better integration with the databases) Off-the-shelf data mining packages perform specialized learning on small subset of data Most research emphasizes machine learning; little emphasis on database side (especially text) People achieving results  are not likely to share knowledge
Related Techniques:  OLAP On-Line Analytical Processing On-Line Analytical Processing tools provide the ability to pose statistical and summary queries interactively Traditional On-Line Transaction Processing (OLTP) databases may take minutes or even hours to answer these queries Advantages relative to data mining Can obtain a wider variety of results Generally faster to obtain results Disadvantages relative to data mining User must “ask the right question” Generally used to determine high-level statistical summaries, rather than specific relationships among instances
Integration of Data Mining and Data Warehousing Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight-coupling On-line analytical mining data integration of mining and OLAP technologies Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions Characterized classification, first clustering and then association
Why Data Preprocessing? Data in the real world is dirty incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“” noisy : containing errors or outliers e.g., Salary=“-10” inconsistent : containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records
Why Is Data Dirty? Incomplete data comes from n/a data value when collected different consideration between the time when the data was collected and when it is analyzed. human/hardware/software problems Noisy data comes from the process of data collection entry transmission Inconsistent data comes from Different data sources Functional dependency violation
Why Is Data Preprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon (father of the data warehouse)
Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data
Forms of data preprocessing
Data Cleaning Importance “ Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “ Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data  Correct inconsistent data Resolve redundancy caused by data integration
Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to  equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.
How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class?!  the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree
Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention  Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can  smooth by bin means ,   smooth by bin   median ,   smooth by bin boundaries , etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) Regression smooth by fitting the data into regression functions
Simple Discretization Methods: Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky.
Thank you! Department of Computer Science University of Wisconsin – Eau Claire Eau Claire, WI 54701 [email_address] 715-836-2526 Michael R. Wick Professor and Chair

Talk

  • 1.
    Department of ComputerScience University of Wisconsin – Eau Claire Eau Claire, WI 54701 [email_address] 715-836-2526 Introduction to Data Mining Michael R. Wick Professor and Chair
  • 2.
    Acknowledgements Some ofthe material used in this talk is drawn from: Dr. Jiawei Han at University of Illinois at Urbana Champaign Dr. Bhavani Thuraisingham (MITRE Corp. and UT Dallas) Dr. Chris Clifton, Indiana Center for Database Systems, Purdue University
  • 3.
    Road Map Definitionand Need Applications Process Types Example: The Apriori Algorithm State of Practice Related Techniques Data Preprocessing
  • 4.
    What Is DataMining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? (Deductive) query processing. Expert systems or small learning programs
  • 5.
    What is DataMining? Real Example from the NBA Play-by-play information recorded by teams Who is on the court Who shoots Results Coaches want to know what works best Plays that work well against a given team Good/bad player matchups Advanced Scout (from IBM Research) is a data mining tool to answer these questions http://www.nba.com/news_feat/beyond/0126.html Starks+Houston+Ward playing
  • 6.
    Necessity for DataMining Large amounts of current and historical data being stored Only small portion (~5-10%) of collected data is analyzed Data that may never be analyzed is collected in the fear that something that may prove important will be missed As databases grow larger, decision-making from the data is not possible; need knowledge derived from the stored data Data sources Health-related services, e.g., benefits, medical analyses Commercial, e.g., marketing and sales Financial Scientific, e.g., NASA, Genome DOD and Intelligence Desired analyses Support for planning (historical supply and demand trends) Yield management (scanning airline seat reservation data to maximize yield per seat) System performance (detect abnormal behavior in a system) Mature database analysis (clean up the data sources)
  • 7.
    Necessity Is theMother of Invention Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
  • 8.
    Potential Applications Dataanalysis and decision support Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection Finding outliers in credit card purchases Other Applications Text mining (news group, email, documents) and Web mining Stream data mining DNA and bio-data analysis
  • 9.
    Data Mining: Confluenceof Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization
  • 10.
    Knowledge Discovery inDatabases: Process adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Knowledge Data Target Data Selection Preprocessed Data Patterns Data Mining Interpretation/ Evaluation Preprocessing
  • 11.
    Steps of aKDD Process Learning the application domain relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning : (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation. Choosing methods of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
  • 12.
    Data Mining andBusiness Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP
  • 13.
    Multiple Perspectives inData Mining Data to be mined Relational, data warehouse, transactional, stream, object-oriented/relational, active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, Web mining, etc.
  • 14.
    Ingredients of anEffective KDD Process Background Knowledge Goals for Learning Knowledge Base Database(s) Plan for Learning Discover Knowledge Determine Knowledge Relevancy Evolve Knowledge/ Data Generate and Test Hypotheses Visualization and Human Computer Interaction Discovery Algorithms “ In order to discover anything, you must be looking for something.” Murphy’s 1 st Law of Serendipity
  • 15.
    What Can DataMining Do? Clustering Identify previously unknown groups Classification Give operational definitions to categories Association Find Association rules Many others…
  • 16.
    Clustering Cluster: acollection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping a set of data objects into clusters Clustering is unsupervised classification: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
  • 17.
    Some Clustering ApproachesIterative Distance-based Clustering Specify in advance the number of desired clusters ( k ) K random points chosen as cluster centers Instances assigned to closest center Centroid (or mean) of all points in cluster is calculated Repeat until clusters are stable Incremental Clustering Uses tree to represent clusters Nodes represent clusters (or subclusters) Instances added one by one and tree updated Updating can involve simple placement of instance in cluster or re-clustering Uses category utility function to determine if instance fits with each cluster Can result in merging or splitting of existing clusters Category Utility Uses quadratic loss function of conditional probabilities Does the addition of new instance help us better predict the value of attributes for other instances?
  • 18.
    General Applications ofClustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns
  • 19.
    Examples of ClusteringApplications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • 20.
    Classification (vs Prediction)Classification: predicts categorical class labels (discrete/nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Learns operational definition Prediction: models continuous-valued functions, i.e., predicts unknown or missing values Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis
  • 21.
    Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formula Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
  • 22.
    Classification Process (1):Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Training Data Classifier (Model)
  • 23.
    Classification Process (2):Use the Model in Prediction (Jeff, Professor, 4) Tenured? Classifier Testing Data Unseen Data
  • 24.
    Classification Approaches Divideand Conquer Results in decision tree Uses “information gain” function Covering - Select category for which to learn rule - Add conditions on rule until “good enough”
  • 25.
    Association Association rulemining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database [AIS93] Motivation: finding regularities in data What products were often purchased together? — Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
  • 26.
    Why Is AssociationMining Important? Foundation for many essential data mining tasks Association, correlation, causality Sequential patterns, temporal or cyclic association, partial periodicity, spatial and multimedia association Associative classification, cluster analysis, iceberg cube, fascicles (semantic data compression) Broad applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis Web log (click stream) analysis, DNA sequence analysis, etc.
  • 27.
    Basic Concepts: AssociationRules Itemset X={x 1 , …, x k } Find all the rules X  Y with min confidence and support support, s , probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y . Let min_support = 50%, min_conf = 50%: A  C (50%, 66.7%) C  A (50%, 100%) B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id Customer buys diaper Customer buys both Customer buys beer
  • 28.
    Mining Association Rules:Example For rule A  C : support = support({ A }  { C }) = 50% confidence = support({ A }  { C })/support({ A }) = 66.6% Min. support 50% Min. confidence 50% B, E, F 40 A, D 30 A, C 20 A, B, C 10 Items bought Transaction-id 50% {A, C} 50% {C} 50% {B} 75% {A} Support Frequent pattern
  • 29.
    Apriori: A CandidateGeneration-and-test Approach Any subset of a frequent itemset must be frequent if {beer, diaper, nuts} is frequent, so is {beer, diaper} Every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori pruning principle : If there is any itemset which is infrequent, its superset should not be generated/tested! Method: generate length (k+1) candidate itemsets from length k frequent itemsets, and test the candidates against DB Performance studies show its efficiency and scalability
  • 30.
    The Apriori Algorithm—AMathematical Definition Let I = {a,b,c,…} be a set of all items in the domain Let T = { S | S  I } be a set of all transaction records of item sets Let support( S ) =  { A | A  T  S  A } | Let L 1 = { { a } | a  I  support({ a })  minSupport }  k ( k > 1  L k -1   ) Let L k = { S i  S j | ( S i  L k -1 )  ( S j  L k -1 )  ( | S i – S j | = 1 )  ( | S j – S i | = 1)  (  S [ (( S  S i  S j )  (| S | = k -1))  S  L k -1 ] )  ( support( S i  S j )  minSupport ) Then, the set of all frequent item sets is given by L = L k and the set of all association rules is given by R = { A  C | A   (L k )  ( C = L k – A)  ( A   )  ( C   ) support(L k ) / support( A )  minConfidence }
  • 31.
    The Apriori Algorithm—AnExample Example: minSupport = 2 I= {Table Saw, Router, Kreg Jig, Sander, Drill Press} T= { {Table Saw, Router, Drill Press}, { Router, Sander }, { Router, Kreg Jig }, {Table Saw, Router, , Sander }, {Table Saw, , Kreg Jig }, { Router, Kreg Jig }, {Table Saw, , Kreg Jig }, {Table Saw, Router, Kreg Jig, , Drill Press}, {Table Saw, Router, Kreg Jig } } L1 = { {T}, {R}, {K}, {S}, {D} } L2 = { {R,T}, {K,T}, {D,T}, {K,R}, {R,S}, {D,R} } L3 = { {K,R,T}, {D,R,T} } L4 =  Rules = ????
  • 32.
    The Apriori AlgorithmPseudo-code : C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for ( k = 1; L k !=  ; k ++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;
  • 33.
    Important Details ofApriori How to generate candidates? Step 1: self-joining L k Step 2: pruning How to count supports of candidates? Example of Candidate-generation L 3 = { abc, abd, acd, ace, bcd } Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={ abcd }
  • 34.
    State of Commercial/ResearchPractice Increasing use of data mining systems in financial community, marketing sectors, retailing Still have major problems with large, dynamic sets of data (need better integration with the databases) Off-the-shelf data mining packages perform specialized learning on small subset of data Most research emphasizes machine learning; little emphasis on database side (especially text) People achieving results are not likely to share knowledge
  • 35.
    Related Techniques: OLAP On-Line Analytical Processing On-Line Analytical Processing tools provide the ability to pose statistical and summary queries interactively Traditional On-Line Transaction Processing (OLTP) databases may take minutes or even hours to answer these queries Advantages relative to data mining Can obtain a wider variety of results Generally faster to obtain results Disadvantages relative to data mining User must “ask the right question” Generally used to determine high-level statistical summaries, rather than specific relationships among instances
  • 36.
    Integration of DataMining and Data Warehousing Data mining systems, DBMS, Data warehouse systems coupling No coupling, loose-coupling, semi-tight-coupling, tight-coupling On-line analytical mining data integration of mining and OLAP technologies Interactive mining multi-level knowledge Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling, pivoting, slicing/dicing, etc. Integration of multiple mining functions Characterized classification, first clustering and then association
  • 37.
    Why Data Preprocessing?Data in the real world is dirty incomplete : lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation=“” noisy : containing errors or outliers e.g., Salary=“-10” inconsistent : containing discrepancies in codes or names e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records
  • 38.
    Why Is DataDirty? Incomplete data comes from n/a data value when collected different consideration between the time when the data was collected and when it is analyzed. human/hardware/software problems Noisy data comes from the process of data collection entry transmission Inconsistent data comes from Different data sources Functional dependency violation
  • 39.
    Why Is DataPreprocessing Important? No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse. —Bill Inmon (father of the data warehouse)
  • 40.
    Major Tasks inData Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data
  • 41.
    Forms of datapreprocessing
  • 42.
    Data Cleaning Importance“ Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball “ Data cleaning is the number one problem in data warehousing”—DCI survey Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration
  • 43.
    Missing Data Datais not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred.
  • 44.
    How to HandleMissing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree
  • 45.
    Noisy Data Noise:random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
  • 46.
    How to HandleNoisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means , smooth by bin median , smooth by bin boundaries , etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human (e.g., deal with possible outliers) Regression smooth by fitting the data into regression functions
  • 47.
    Simple Discretization Methods:Binning Equal-width (distance) partitioning: Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: Divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky.
  • 48.
    Thank you! Departmentof Computer Science University of Wisconsin – Eau Claire Eau Claire, WI 54701 [email_address] 715-836-2526 Michael R. Wick Professor and Chair

Editor's Notes

  • #11 Mine for: Selection Aggregation Abstraction Visualization Transformation/Conversion Statistical Analysis “Cleaning”