Data Mining Concept Ho Viet Lam - Nguyen Thi My Dung May, 14 th  2007
Content Introduction Overview of data mining technology Association rules Classification Clustering Applications of data mining Commercial tools Conclusion
Introduction What is data mining? Why do we need to ‘mine’ data? On what kind of data can we ‘mine’?
What is data mining? The process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositoties, using pattern recognition technologies as well as statical and methematics techniques. A part of  K nowledge  D iscovery in  D ata  (KDD) process.
Why data mining? The explosive growth in data collection The storing of data in data warehouses The availability of increased access to data from Web navigation and intranet    We have to find a more effective way to use these data in  decision support process  than just using traditional querry languages
On what kind of data? Relational databases Data warehouses Transactional databases Advanced   database   systems Object-relational Spacial and Temporal Time-series Multimedia, text WWW … Structure - 3D Anatomy Function – 1D Signal Metadata – Annotation
Overview of data mining technology Data Mining vs. Data Warehousing Data Mining as a part of Knowledge Discovery Process Goals of Data Mining and Knowledge Discovery Types of Knowledge Discovery during Data Mining
Data Mining vs. Data Warehousing A Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources. The goal of data warehousing is to support decision making with data!
Knowledge Discovery in Databases and Data Mining The non-trivial extraction of implicit, unknown, and potentially useful information from databases. The knowledge discovery process comprises six phases:
Goals of Data Mining and KDD Prediction : how certain attibutes within the data will behave in the future. Identification : identify the existence of an item, an event, an activity. Classification : partition the data into categories. Optimization : optimize the use of limited resources.
Types of Knowledge Discovery during Data Mining Association rules Classification heirarchies Sequential patterns Patterns within time-series Clustering
Introduction Overview of data mining technology Association rules Classification Clustering Application of data mining Commercial tools Conclusion Content
Association Rules Purpose Providing the rules correlate the presence of a set of items with another set of item Examples:
Association Rules Some concepts Market-basket model Look for combinations of products Put the SHOES near the SOCKS so that if a customer buys one they will buy the other Transactions: is the fact the person buys some items in the itemset at supermarket
Association Rules Some concepts Support: it refers how frequently a specific itemset occurs in the database X => Y:  Bread => Juice is 50% the confidence of the rule X=>Y: support (X U Y) / support (X)  The goal of mining association rules is generate all possible rules that exceed some minimum user-specified support and confidence thresholds Bread, cookies, coffee 8:40 1735 Milk, eggs 8:05 1130 Milk, juice 7:38 792 Bread, Milk, cookies, juice 6:35 101 Items-Bought time Transaction-id 1 Coffee 1 Eggs 2 Juice 2 Cookies 2 Bread 3 Milk support Item
Association Rules Apriori Algorithm Input:  database of m transactions,  D, and a minimum support, mins, represented as a fraction of m Output : frequent itemsets, L 1 , L 2 , …, L k
Association Rules Apriori algorithm D mins = 2 minf = 0.5 freq >  0.5 Bread, cookies, coffee 8:40 1735 Milk, eggs 8:05 1130 Milk, juice 7:38 792 Bread, milk, cookies, juice 6:35 101 Items-Bought time Transaction-id 1 Eggs 1 Coffee 2 Juice 2 Cookies 2 Bread 3 Milk support Item 0.75, 0.5, 0.5, 0.5, 0.25, 0.25 milk, bread, juice, cookies, eggs, coffee The candidate 1-itemsets 0.75, 0.5, 0.5, 0.5 milk, bread, juice, cookies frequent 1-itemsets 0.25, 0.5, 0.25, 0.25, 0.5, 0.25 {milk, bread}, {milk, juice}, {bread, juice}, {milk, cookies}, {bread, cookies}, {juice, cookies} The candidate 2-itemsets 0.5, 0.5 {milk, juice}, {bread, cookies} frequent 2-itemsets {……………….} {……………..} The candidate 3-itemsets Ф Ф frequent 3-itemsets milk, bread, juice, cookies, {milk, juice},  {bread, cookies} result
Apriori Algorithm Begin + compute support(i j ) = count(i j )/m for each individual item, i 1 , i 2 , ..,i n  by scanning the database once and counting the number of transactions that item i j  appears in  + the candidate frequent 1-itemset, C1, will be the set of items i1, i2, …, in + the subset of items containing i j  form Ci where support(i j ) >= mins becomes the frequent + 1-itemset, L1; + k=1; + termination = false; + repeat + L k+1  = ; + create the candidate frequent (k+1)-itemset, C k+1 , by combining members of L k  that have k-1 items in common; (this forms candidate frequent (k+1)-itemsets by selectively extending frequent k-itemsets by one item) + in addition, only consider as elements of C k+1  those k+1 items such that every subset of size k appears in L k ; + scan the database once and compute the support for each member of C k+1;  if the support for a member of C k+1 ; if the support for a member of C k+1  >= min then add that member to L k+1 ; + if L k+1  is empty then termination = true   else k = k+1; + until termination end;
Association Rules Demo of Apriori Algorithm
Association Rules Frequent-pattern tree algorithm Motivated by the fact that Apriori based algorithms may generate and test a very large number of candidate itemsets. Example: with 1000 frequent 1-items, Apriori would have to generate 2^1000 candidate 2-itemsets The FP-growth algorithm is one aproach that eliminates the generation of a large number of candidate itemsets
Association Rules Frequent-pattern tree algorithm Generating a compressed version of the database in terms of an FP-Tree FP-Growth Algorithm for finding frequent itemsets
Association Rules FP-Tree algorithm Item head table Root Bread, milk, cookies, juice Milk, bread, cookies, juice Milk:1 Bread:1 Cookies:1 Juice:1 Milk, juice Juice:1 Milk:2 Milk, eggs Milk Milk: 3 Bread, cookies, coffee Bread, cookies Bread:1 Cookies:1 Transaction 1 Transaction 2 Transaction 3 Transaction 4 2 juice 2 cookies 2 bread 3 Milk link Support Item
Association Rules FP-Growth algorithm Root Bread:1 Cookies:1 Juice:1 Juice:1 Milk: 3 Bread:1 Cookies:1 Milk, juice Bread, cookies Milk bread cookies juice
Association Rules FP-Growth algorithm Procedure FP-growth (tree, alpha); Begin If tree contains a single path then For each combination, beta, of the nodes in the path Generate pattern (beta U alpha) with support = minimum support of nodes in beta Else For each item, I, in the header of the tree do Begin Generate pattern beta = (I U alpha) with support = i.support; Construct beta’s conditional pattern base; Construct beta’s conditional FP-tree, bete_tree; If beta_tree is not empty then FP-growth (beta_tree, beta); End end
Association Rules Demo of FP-Growth algorithm
Classification Introduction  Classification is the process of learning a model that describes different classes of data,  the classes are predetermined The model that is produced is usually in the form of a  decision tree  or a set of rules married salary Acct balance age Yes <20k Poor risk >=20k <50k Fair risk >=50 Good risk no <5k Poor risk >=25 <25 >5k Fair risk Good risk
Class attribute Expected information Salary I(3,3)=1 Information gain Gain(A) = I-E(A) E(Married)=0.92 Gain(Married)=0.08 E(Salary)=0.33 Gain(Salary)= 0.67 E(A.balance)=0.82 Gain(A.balance)=0.18 E(Age)=0.81 Gain(Age)= 0.19 age Class is “no” {4,5} >=50k 20k..50k <20k Class is “no” {3} Class is “yes” {6} Class is “yes” {1,2} Entropy <25 >=25 Yes >=25 >=5k 20k..50k Yes 6 No >=25 <5k <20k No 5 No <25 >=5k <20k No 4 No <25 <5k 20k..50k Yes 3 Yes >=25 >=5k >=50 Yes 2 Yes >=25 <5k >=50 No 1 Loanworthy Age Acct balance Salary Married RID
Classification Algorithm for decision tree induction Procedure Build_tree (records, attributes); Begin Create a node N; If all records belongs to the same class, C then Return N as a leaf node with class label C; If Attributes is empty then Return n  as a leaf node with class label C, such that the majority of records belong to it; Select attribute Ai (with the highest information gain) from Attributes; Label node N with Ai; For each know value, Vj, of Ai do Begin  Add a brach from node N for the condition Ai=Vj; Sj=subset of Records where Ai=Vj; If Sj is empty then Add a leaf, L, with class label C, such that the majority of records belong to it and return L Else add the node returned by build_tree(Sj, Attributes – Aj) End; End;
Classification Demo of decision tree
Clustering Introduction The previous data mining task of classification deals with partitioning data based on a pre-classified training sample Clustering is an automated process to group related records together.  Related records are grouped together on the basis of having similar values for attributes  The groups are usually disjoint
Clustering Some concepts An important facet of clustering is the similarity function that is used When the data is number, a similarity function based on distance is typically used Euclidean metric (Euclidean distance), Minkowsky metric, Mahattan metric.
Clustering K-means clustering algorithm Input: a database D, of m records r1,…,rm and a desired number of clusters. k Output: set of k clusters Begin Randomly choose k records as the centroids for the k clusters’ Repeat Assign each record, ri, to a cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k clusters; Recalculate the centroid (mean) for each cluster based on the records assigned to the cluster; Until no change; End;
Clustering Demo of K-means algorithm
Content Introduction Overview of data mining technology Association rules Classification Clustering Applications of data mining Commercial tools Conclusion
Applications of data mining Market analysis Marketing stratergies Advertisement Risk analysis and management Finance and finance investments Manufacturing and production Fraud detection and detection of unusual patterns (outliers) Telecommunication   Finanancial transactions Anti-terrorism (!!!)
Applications of data mining Text mining (news group, email, documents) and Web mining Stream data mining DNA and bio-data analysis Diseases outcome Effectiveness of treatments Identify new drugs
Commercial tools Oracle Data Miner http://www.oracle.com/technology/products/bi/odm/odminer.html Data To Knowledge   http://alg.ncsa.uiuc.edu/do/tools/d2k SAS  http://www.sas.com/ Clementine http://spss.com/clemetine/ Intelligent Miner   http://www-306.ibm.com/software/data/iminer/
Conclusion Data mining is a “decision support” process in which we search for patterns of information in data. This technique can be used on many types of data.  Overlaps with machine learning, statistics, artificial intelligence, databases, visualization…
Conclusion The result of mining may be to discover the following type of “new” information: Association rules Sequencial patterns Classification trees …
References Fundamentals of Database Systems fourth edition -- R.Elmasri, S.B.Navathe -- Addition Wesley -- ISBN 0-321-20448-4 Discovering Knowledge in Data: An Introduction to Data Mining Daniel T.Larose –Willey – ISBN 0-471-66652-2 RESOURCES FROM THE INTERNET Thanks for listening!!!

Data Mining Concepts

  • 1.
    Data Mining ConceptHo Viet Lam - Nguyen Thi My Dung May, 14 th 2007
  • 2.
    Content Introduction Overviewof data mining technology Association rules Classification Clustering Applications of data mining Commercial tools Conclusion
  • 3.
    Introduction What isdata mining? Why do we need to ‘mine’ data? On what kind of data can we ‘mine’?
  • 4.
    What is datamining? The process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositoties, using pattern recognition technologies as well as statical and methematics techniques. A part of K nowledge D iscovery in D ata (KDD) process.
  • 5.
    Why data mining?The explosive growth in data collection The storing of data in data warehouses The availability of increased access to data from Web navigation and intranet  We have to find a more effective way to use these data in decision support process than just using traditional querry languages
  • 6.
    On what kindof data? Relational databases Data warehouses Transactional databases Advanced database systems Object-relational Spacial and Temporal Time-series Multimedia, text WWW … Structure - 3D Anatomy Function – 1D Signal Metadata – Annotation
  • 7.
    Overview of datamining technology Data Mining vs. Data Warehousing Data Mining as a part of Knowledge Discovery Process Goals of Data Mining and Knowledge Discovery Types of Knowledge Discovery during Data Mining
  • 8.
    Data Mining vs.Data Warehousing A Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources. The goal of data warehousing is to support decision making with data!
  • 9.
    Knowledge Discovery inDatabases and Data Mining The non-trivial extraction of implicit, unknown, and potentially useful information from databases. The knowledge discovery process comprises six phases:
  • 10.
    Goals of DataMining and KDD Prediction : how certain attibutes within the data will behave in the future. Identification : identify the existence of an item, an event, an activity. Classification : partition the data into categories. Optimization : optimize the use of limited resources.
  • 11.
    Types of KnowledgeDiscovery during Data Mining Association rules Classification heirarchies Sequential patterns Patterns within time-series Clustering
  • 12.
    Introduction Overview ofdata mining technology Association rules Classification Clustering Application of data mining Commercial tools Conclusion Content
  • 13.
    Association Rules PurposeProviding the rules correlate the presence of a set of items with another set of item Examples:
  • 14.
    Association Rules Someconcepts Market-basket model Look for combinations of products Put the SHOES near the SOCKS so that if a customer buys one they will buy the other Transactions: is the fact the person buys some items in the itemset at supermarket
  • 15.
    Association Rules Someconcepts Support: it refers how frequently a specific itemset occurs in the database X => Y: Bread => Juice is 50% the confidence of the rule X=>Y: support (X U Y) / support (X) The goal of mining association rules is generate all possible rules that exceed some minimum user-specified support and confidence thresholds Bread, cookies, coffee 8:40 1735 Milk, eggs 8:05 1130 Milk, juice 7:38 792 Bread, Milk, cookies, juice 6:35 101 Items-Bought time Transaction-id 1 Coffee 1 Eggs 2 Juice 2 Cookies 2 Bread 3 Milk support Item
  • 16.
    Association Rules AprioriAlgorithm Input: database of m transactions, D, and a minimum support, mins, represented as a fraction of m Output : frequent itemsets, L 1 , L 2 , …, L k
  • 17.
    Association Rules Apriorialgorithm D mins = 2 minf = 0.5 freq > 0.5 Bread, cookies, coffee 8:40 1735 Milk, eggs 8:05 1130 Milk, juice 7:38 792 Bread, milk, cookies, juice 6:35 101 Items-Bought time Transaction-id 1 Eggs 1 Coffee 2 Juice 2 Cookies 2 Bread 3 Milk support Item 0.75, 0.5, 0.5, 0.5, 0.25, 0.25 milk, bread, juice, cookies, eggs, coffee The candidate 1-itemsets 0.75, 0.5, 0.5, 0.5 milk, bread, juice, cookies frequent 1-itemsets 0.25, 0.5, 0.25, 0.25, 0.5, 0.25 {milk, bread}, {milk, juice}, {bread, juice}, {milk, cookies}, {bread, cookies}, {juice, cookies} The candidate 2-itemsets 0.5, 0.5 {milk, juice}, {bread, cookies} frequent 2-itemsets {……………….} {……………..} The candidate 3-itemsets Ф Ф frequent 3-itemsets milk, bread, juice, cookies, {milk, juice}, {bread, cookies} result
  • 18.
    Apriori Algorithm Begin+ compute support(i j ) = count(i j )/m for each individual item, i 1 , i 2 , ..,i n by scanning the database once and counting the number of transactions that item i j appears in + the candidate frequent 1-itemset, C1, will be the set of items i1, i2, …, in + the subset of items containing i j form Ci where support(i j ) >= mins becomes the frequent + 1-itemset, L1; + k=1; + termination = false; + repeat + L k+1 = ; + create the candidate frequent (k+1)-itemset, C k+1 , by combining members of L k that have k-1 items in common; (this forms candidate frequent (k+1)-itemsets by selectively extending frequent k-itemsets by one item) + in addition, only consider as elements of C k+1 those k+1 items such that every subset of size k appears in L k ; + scan the database once and compute the support for each member of C k+1; if the support for a member of C k+1 ; if the support for a member of C k+1 >= min then add that member to L k+1 ; + if L k+1 is empty then termination = true else k = k+1; + until termination end;
  • 19.
    Association Rules Demoof Apriori Algorithm
  • 20.
    Association Rules Frequent-patterntree algorithm Motivated by the fact that Apriori based algorithms may generate and test a very large number of candidate itemsets. Example: with 1000 frequent 1-items, Apriori would have to generate 2^1000 candidate 2-itemsets The FP-growth algorithm is one aproach that eliminates the generation of a large number of candidate itemsets
  • 21.
    Association Rules Frequent-patterntree algorithm Generating a compressed version of the database in terms of an FP-Tree FP-Growth Algorithm for finding frequent itemsets
  • 22.
    Association Rules FP-Treealgorithm Item head table Root Bread, milk, cookies, juice Milk, bread, cookies, juice Milk:1 Bread:1 Cookies:1 Juice:1 Milk, juice Juice:1 Milk:2 Milk, eggs Milk Milk: 3 Bread, cookies, coffee Bread, cookies Bread:1 Cookies:1 Transaction 1 Transaction 2 Transaction 3 Transaction 4 2 juice 2 cookies 2 bread 3 Milk link Support Item
  • 23.
    Association Rules FP-Growthalgorithm Root Bread:1 Cookies:1 Juice:1 Juice:1 Milk: 3 Bread:1 Cookies:1 Milk, juice Bread, cookies Milk bread cookies juice
  • 24.
    Association Rules FP-Growthalgorithm Procedure FP-growth (tree, alpha); Begin If tree contains a single path then For each combination, beta, of the nodes in the path Generate pattern (beta U alpha) with support = minimum support of nodes in beta Else For each item, I, in the header of the tree do Begin Generate pattern beta = (I U alpha) with support = i.support; Construct beta’s conditional pattern base; Construct beta’s conditional FP-tree, bete_tree; If beta_tree is not empty then FP-growth (beta_tree, beta); End end
  • 25.
    Association Rules Demoof FP-Growth algorithm
  • 26.
    Classification Introduction Classification is the process of learning a model that describes different classes of data, the classes are predetermined The model that is produced is usually in the form of a decision tree or a set of rules married salary Acct balance age Yes <20k Poor risk >=20k <50k Fair risk >=50 Good risk no <5k Poor risk >=25 <25 >5k Fair risk Good risk
  • 27.
    Class attribute Expectedinformation Salary I(3,3)=1 Information gain Gain(A) = I-E(A) E(Married)=0.92 Gain(Married)=0.08 E(Salary)=0.33 Gain(Salary)= 0.67 E(A.balance)=0.82 Gain(A.balance)=0.18 E(Age)=0.81 Gain(Age)= 0.19 age Class is “no” {4,5} >=50k 20k..50k <20k Class is “no” {3} Class is “yes” {6} Class is “yes” {1,2} Entropy <25 >=25 Yes >=25 >=5k 20k..50k Yes 6 No >=25 <5k <20k No 5 No <25 >=5k <20k No 4 No <25 <5k 20k..50k Yes 3 Yes >=25 >=5k >=50 Yes 2 Yes >=25 <5k >=50 No 1 Loanworthy Age Acct balance Salary Married RID
  • 28.
    Classification Algorithm fordecision tree induction Procedure Build_tree (records, attributes); Begin Create a node N; If all records belongs to the same class, C then Return N as a leaf node with class label C; If Attributes is empty then Return n as a leaf node with class label C, such that the majority of records belong to it; Select attribute Ai (with the highest information gain) from Attributes; Label node N with Ai; For each know value, Vj, of Ai do Begin Add a brach from node N for the condition Ai=Vj; Sj=subset of Records where Ai=Vj; If Sj is empty then Add a leaf, L, with class label C, such that the majority of records belong to it and return L Else add the node returned by build_tree(Sj, Attributes – Aj) End; End;
  • 29.
  • 30.
    Clustering Introduction Theprevious data mining task of classification deals with partitioning data based on a pre-classified training sample Clustering is an automated process to group related records together. Related records are grouped together on the basis of having similar values for attributes The groups are usually disjoint
  • 31.
    Clustering Some conceptsAn important facet of clustering is the similarity function that is used When the data is number, a similarity function based on distance is typically used Euclidean metric (Euclidean distance), Minkowsky metric, Mahattan metric.
  • 32.
    Clustering K-means clusteringalgorithm Input: a database D, of m records r1,…,rm and a desired number of clusters. k Output: set of k clusters Begin Randomly choose k records as the centroids for the k clusters’ Repeat Assign each record, ri, to a cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the k clusters; Recalculate the centroid (mean) for each cluster based on the records assigned to the cluster; Until no change; End;
  • 33.
    Clustering Demo ofK-means algorithm
  • 34.
    Content Introduction Overviewof data mining technology Association rules Classification Clustering Applications of data mining Commercial tools Conclusion
  • 35.
    Applications of datamining Market analysis Marketing stratergies Advertisement Risk analysis and management Finance and finance investments Manufacturing and production Fraud detection and detection of unusual patterns (outliers) Telecommunication Finanancial transactions Anti-terrorism (!!!)
  • 36.
    Applications of datamining Text mining (news group, email, documents) and Web mining Stream data mining DNA and bio-data analysis Diseases outcome Effectiveness of treatments Identify new drugs
  • 37.
    Commercial tools OracleData Miner http://www.oracle.com/technology/products/bi/odm/odminer.html Data To Knowledge http://alg.ncsa.uiuc.edu/do/tools/d2k SAS http://www.sas.com/ Clementine http://spss.com/clemetine/ Intelligent Miner http://www-306.ibm.com/software/data/iminer/
  • 38.
    Conclusion Data miningis a “decision support” process in which we search for patterns of information in data. This technique can be used on many types of data. Overlaps with machine learning, statistics, artificial intelligence, databases, visualization…
  • 39.
    Conclusion The resultof mining may be to discover the following type of “new” information: Association rules Sequencial patterns Classification trees …
  • 40.
    References Fundamentals ofDatabase Systems fourth edition -- R.Elmasri, S.B.Navathe -- Addition Wesley -- ISBN 0-321-20448-4 Discovering Knowledge in Data: An Introduction to Data Mining Daniel T.Larose –Willey – ISBN 0-471-66652-2 RESOURCES FROM THE INTERNET Thanks for listening!!!