Process Mining - Chapter 3 - Data Mining

2,244 views

Published on

Slides supporting the book "Process Mining: Discovery, Conformance, and Enhancement of Business Processes" by Wil van der Aalst. See also http://springer.com/978-3-642-19344-6 (ISBN 978-3-642-19344-6) and the website http://www.processmining.org/book/start providing sample logs.

Published in: Business, Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,244
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
154
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Process Mining - Chapter 3 - Data Mining

  1. 1. Chapter 3Data Miningprof.dr.ir. Wil van der Aalstwww.processmining.org
  2. 2. OverviewChapter 1IntroductionPart I: PreliminariesChapter 2 Chapter 3Process Modeling and Data MiningAnalysisPart II: From Event Logs to Process ModelsChapter 4 Chapter 5 Chapter 6Getting the Data Process Discovery: An Advanced Process Introduction Discovery TechniquesPart III: Beyond Process DiscoveryChapter 7 Chapter 8 Chapter 9Conformance Mining Additional Operational SupportChecking PerspectivesPart IV: Putting Process Mining to WorkChapter 10 Chapter 11 Chapter 12Tool Support Analyzing “Lasagna Analyzing “Spaghetti Processes” Processes”Part V: ReflectionChapter 13 Chapter 14Cartography and EpilogueNavigation PAGE 1
  3. 3. Data mining• The growth of the “digital universe” is the main driver for the popularity of data mining.• Initially, the term “data mining” had a negative connotation (“data snooping”, “fishing”, and “data dredging”).• Now a mature discipline.• Data-centric, not process-centric. PAGE 2
  4. 4. Data about 860 recently deceased Data set 1 persons to study the effects of drinking, smoking, and body weight on the life expectancy.Questions:- What is the effect of smoking and drinking on a person’s bodyweight?- Do people that smoke also drink?- What factors influence a person’s life expectancy the most?- Can one identify groups of people having a similar lifestyle? PAGE 3
  5. 5. Data about 420 students to investigate Data set 2 relationships among course grades and the student’s overall performance in the Bachelor program.Questions:- Are the marks of certain courses highly correlated?- Which electives do excellent students (cum laude) take?- Which courses significantly delay the moment of graduation?- Why do students drop out?- Can one identify groups of students having a similar studybehavior? PAGE 4
  6. 6. Data on 240 customer orders Data set 3 in a coffee bar recorded by the cash register.Questions:- Which products are frequently purchased together?- When do people buy a particular product?- Is it possible to characterize typical customer groups?- How to promote the sales of products with a higher margin? PAGE 5
  7. 7. Variables• Data set (sample or table) consists of instances (individuals, entities, cases, objects, or records).• Variables are often referred to as attributes, features, or data elements.• Two types: − categorical variables: − ordinal (high-med-low, cum laude-passed-failed) or − nominal (true-false, red-pink-green) − numerical variables (ordered, cannot be enumerated easily) PAGE 6
  8. 8. Supervised Learning• Labeled data, i.e., there is a response variable that labels each instance.• Goal: explain response variable (dependent variable) in terms of predictor variables (independent variables).• Classification techniques (e.g., decision tree learning) assume a categorical response variable and the goal is to classify instances based on the predictor variables.• Regression techniques assume a numerical response variable. The goal is to find a function that fits the data with the least error. PAGE 7
  9. 9. Unsupervised Learning• Unsupervised learning assumes unlabeled data, i.e., the variables are not split into response and predictor variables.• Examples: clustering (e.g., k-means clustering and agglomerative hierarchical clustering) and pattern discovery (association rules) PAGE 8
  10. 10. Decision tree learning: data set 1 smoker yes no young drinker(195/11) yes no old weight <90 ≥90 (65/2) old young(219/34) (381/55) PAGE 9
  11. 11. Decision tree learning: data set 2 logic ≥8 - failed <8 program (79/10) ming ≥7 linear <7 algebra cum laude ≥6 <6 (20/2) linear algebra ≥6 passed operat. (87/11) <6 <6 research ≥6 passed (31/7) failed failed passed (20/4)(101/8) (82/7) PAGE 10
  12. 12. Decision tree learning: data set 3 tea 0 ≥1 muffin latte 0 ≥2 (30/1)no muffin 1 muffin(189/10) (4/0) espresso 0 ≥1 muffin no muffin (6/2) (11/3) PAGE 11
  13. 13. Basic idea #young=546 young Overall E = 0.946848 #old=314 E=0.946848 (860/303) information gain• Split the set of is 0.107012 split on attribute smoker instances in subsets such that Overall E = 0.839836 #young=184 the variation within #old=11 E = 0.313027 yes smoker no information gain each subset young young #young=362 #old=303 is 0.076468 (195/11) becomes smaller. (665/303) E=0.994314• Based on notion of split on attribute drinker entropy or similar. #young=184 Overall E = 0.763368• Minimize average #old=11 E = 0.313027 yes smoker no entropy; maximize young #young=2 drinker #old=63 information gain (195/11) yes no E=0.198234 per step. #young=360 #old=240 young (600/240) old (65/2) E=0.970951 PAGE 12
  14. 14. Clusteringage age + + cluster A cluster B + cluster C weight weight PAGE 13
  15. 15. k-means clustering + + + + + + (a) (b) (c) PAGE 14
  16. 16. Agglomerative hierarchical clustering dendrogram abcdefghij a c b d efghij abcd efg hij f h ab cd fg hi e g i j a b c d e f g h i j (a) (b) PAGE 15
  17. 17. Levels introduced by agglomerativehierarchical clustering abcdefghij a c b d efghij abcd efg hij f h ab cd fg hi e g i j a b c d e f g h i j (a) (b)Any horizontal line in dendrogramcorresponds to a concrete clustering ata particular level of abstraction PAGE 16
  18. 18. Association rule learning• Rules of form “IF X THEN Y” PAGE 17
  19. 19. Special case: market basket analysis PAGE 18
  20. 20. Example (people that order tea and latte also order muffins)• Support should be as high as possible (but will be low in case of many items).• Confidence should be close to 1.• High lift values suggest a positive correlation (1 if independent). PAGE 19
  21. 21. Brute force algorithm PAGE 20
  22. 22. Apriori (optimization based on twoobservations) PAGE 21
  23. 23. Sequencemining PAGE 22
  24. 24. Episode mining(32 time windows of length 5) a c b e d f c b b c a e b e c d c b 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 b b ba d a d c c c E1 E2 E3 PAGE 23
  25. 25. Occurrences b b b a d a d c c c E1 E2 E3 E2 (16x) a c b e d f c b b c a e b e c d c b 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 E1 E1 E3 PAGE 24
  26. 26. Hidden Markov models• Given an observation sequence, s state how to compute the probability of x observation the sequence given a hidden Markov model? 0.7 transition with probability• Given an observation sequence 0.5 observation probability and a hidden Markov model, how to compute the most likely 1.0 “hidden path” in the model? 0.7 0.2 0.3• Given a set of observation s1 s2 s3 0.8 sequences, how to derive the 0.5 0.5 0.6 0.4 0.8 0.2 hidden Markov model that maximizes the probability of producing these sequences? a b c d e PAGE 25
  27. 27. Relation between data mining andprocess mining• Process mining: about end-to-end processes.• Data mining: data-centric and not process-centric.• Judging the quality of data mining and process mining: many similarities, but also some differences.• Clearly, process mining techniques can benefit from experiences in the data mining field.• Let us now focus on the quality of mining results. PAGE 26
  28. 28. Confusion matrix logic ≥8 - failed <8 program (79/10) ming ≥7 linear <7 cum laude predicted algebra ≥6 <6 linear (20/2) class algebra ≥6 cum laude passed operat. (87/11) passed <6 <6 research ≥6 passed failed (31/7) failed failed passed (20/4)(101/8) (82/7) failed 178 22 0 actual class passed 21 175 2 cum laude 1 3 18 PAGE 27
  29. 29. Confusion matrix: metrics predicted name formula class error (fp+fn)/N + - accuracy tp-rate (tp+tn)/N tp/p + tp fn p fp/n actual fp-rate class - fp tn n precision tp/p’ recall tp/p p’ n’ N (a) (b)tp is the number of true positives, i.e., instances that are correctly classified as positive.fn is the number of false negatives, i.e., instances that are predicted to be negative butshould have been classified as positive.fp is the number of false positives, i.e., instances that are predicted to be positive but shouldhave been classified as negative. PAGE 28tn is the number of true negatives, i.e., instances that are correctly classified as negative.
  30. 30. Example #young=546 young Overall E = 0.946848 #old=314 E=0.946848 (860/303) information gain is 0.107012 split on attribute smoker #young=184 Overall E = 0.839836 #old=11 smoker yes noE = 0.313027 information gain is 0.076468 #young=362 young young #old=303 (195/11) (665/303) E=0.994314 predicted predicted class class split on attribute drinker young young old old #young=184 Overall E = 0.763368 #old=11 smoker yes no young 546 0 young 544 2 actual actual class classE = 0.313027 young drinker #young=2 #old=63 old 314 0 old 251 63 (195/11) yes no E=0.198234 (a) (b) #young=360 young old #old=240 (600/240) (65/2) E=0.970951 PAGE 29
  31. 31. Cross-validation learning algorithm model training set split test data set performance test set indicator PAGE 30
  32. 32. k-fold cross-validation learning algorithm model split test data set performance indicator k data sets rotate PAGE 31
  33. 33. Occam’s Razor• Principle attributed to the 14thcentury English logician William of Ockham.• The principle states that “one should not increase, beyond what is necessary, the number of entities required to explain anything”, i.e., one should look for the “simplest model” that can explain what is observed in the data set.• The Minimal Description Length (MDL) principle tries to operationalize Occam’s. In MDL performance is judged on the training data alone and not measured against new, unseen instances. The basic idea is that the “best” model is the one that minimizes the encoding of both model and data set. PAGE 32

×