Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Process Mining - Chapter 1 - Introd... by Wil van der Aalst 5698 views
- Process Mining - Chapter 2 - Proces... by Wil van der Aalst 2632 views
- Process Mining - a new governance a... by Martin Pscheidl 1958 views
- Process Mining - Chapter 6 - Advanc... by Wil van der Aalst 2720 views
- Process Mining by Freek Hermkens 2686 views
- Process Mining Introduction by Vala Ali Rohani 1676 views

2,641 views

Published on

No Downloads

Total views

2,641

On SlideShare

0

From Embeds

0

Number of Embeds

27

Shares

0

Downloads

193

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Chapter 3Data Miningprof.dr.ir. Wil van der Aalstwww.processmining.org
- 2. OverviewChapter 1IntroductionPart I: PreliminariesChapter 2 Chapter 3Process Modeling and Data MiningAnalysisPart II: From Event Logs to Process ModelsChapter 4 Chapter 5 Chapter 6Getting the Data Process Discovery: An Advanced Process Introduction Discovery TechniquesPart III: Beyond Process DiscoveryChapter 7 Chapter 8 Chapter 9Conformance Mining Additional Operational SupportChecking PerspectivesPart IV: Putting Process Mining to WorkChapter 10 Chapter 11 Chapter 12Tool Support Analyzing “Lasagna Analyzing “Spaghetti Processes” Processes”Part V: ReflectionChapter 13 Chapter 14Cartography and EpilogueNavigation PAGE 1
- 3. Data mining• The growth of the “digital universe” is the main driver for the popularity of data mining.• Initially, the term “data mining” had a negative connotation (“data snooping”, “fishing”, and “data dredging”).• Now a mature discipline.• Data-centric, not process-centric. PAGE 2
- 4. Data about 860 recently deceased Data set 1 persons to study the effects of drinking, smoking, and body weight on the life expectancy.Questions:- What is the effect of smoking and drinking on a person’s bodyweight?- Do people that smoke also drink?- What factors influence a person’s life expectancy the most?- Can one identify groups of people having a similar lifestyle? PAGE 3
- 5. Data about 420 students to investigate Data set 2 relationships among course grades and the student’s overall performance in the Bachelor program.Questions:- Are the marks of certain courses highly correlated?- Which electives do excellent students (cum laude) take?- Which courses significantly delay the moment of graduation?- Why do students drop out?- Can one identify groups of students having a similar studybehavior? PAGE 4
- 6. Data on 240 customer orders Data set 3 in a coffee bar recorded by the cash register.Questions:- Which products are frequently purchased together?- When do people buy a particular product?- Is it possible to characterize typical customer groups?- How to promote the sales of products with a higher margin? PAGE 5
- 7. Variables• Data set (sample or table) consists of instances (individuals, entities, cases, objects, or records).• Variables are often referred to as attributes, features, or data elements.• Two types: − categorical variables: − ordinal (high-med-low, cum laude-passed-failed) or − nominal (true-false, red-pink-green) − numerical variables (ordered, cannot be enumerated easily) PAGE 6
- 8. Supervised Learning• Labeled data, i.e., there is a response variable that labels each instance.• Goal: explain response variable (dependent variable) in terms of predictor variables (independent variables).• Classification techniques (e.g., decision tree learning) assume a categorical response variable and the goal is to classify instances based on the predictor variables.• Regression techniques assume a numerical response variable. The goal is to find a function that fits the data with the least error. PAGE 7
- 9. Unsupervised Learning• Unsupervised learning assumes unlabeled data, i.e., the variables are not split into response and predictor variables.• Examples: clustering (e.g., k-means clustering and agglomerative hierarchical clustering) and pattern discovery (association rules) PAGE 8
- 10. Decision tree learning: data set 1 smoker yes no young drinker(195/11) yes no old weight <90 ≥90 (65/2) old young(219/34) (381/55) PAGE 9
- 11. Decision tree learning: data set 2 logic ≥8 - failed <8 program (79/10) ming ≥7 linear <7 algebra cum laude ≥6 <6 (20/2) linear algebra ≥6 passed operat. (87/11) <6 <6 research ≥6 passed (31/7) failed failed passed (20/4)(101/8) (82/7) PAGE 10
- 12. Decision tree learning: data set 3 tea 0 ≥1 muffin latte 0 ≥2 (30/1)no muffin 1 muffin(189/10) (4/0) espresso 0 ≥1 muffin no muffin (6/2) (11/3) PAGE 11
- 13. Basic idea #young=546 young Overall E = 0.946848 #old=314 E=0.946848 (860/303) information gain• Split the set of is 0.107012 split on attribute smoker instances in subsets such that Overall E = 0.839836 #young=184 the variation within #old=11 E = 0.313027 yes smoker no information gain each subset young young #young=362 #old=303 is 0.076468 (195/11) becomes smaller. (665/303) E=0.994314• Based on notion of split on attribute drinker entropy or similar. #young=184 Overall E = 0.763368• Minimize average #old=11 E = 0.313027 yes smoker no entropy; maximize young #young=2 drinker #old=63 information gain (195/11) yes no E=0.198234 per step. #young=360 #old=240 young (600/240) old (65/2) E=0.970951 PAGE 12
- 14. Clusteringage age + + cluster A cluster B + cluster C weight weight PAGE 13
- 15. k-means clustering + + + + + + (a) (b) (c) PAGE 14
- 16. Agglomerative hierarchical clustering dendrogram abcdefghij a c b d efghij abcd efg hij f h ab cd fg hi e g i j a b c d e f g h i j (a) (b) PAGE 15
- 17. Levels introduced by agglomerativehierarchical clustering abcdefghij a c b d efghij abcd efg hij f h ab cd fg hi e g i j a b c d e f g h i j (a) (b)Any horizontal line in dendrogramcorresponds to a concrete clustering ata particular level of abstraction PAGE 16
- 18. Association rule learning• Rules of form “IF X THEN Y” PAGE 17
- 19. Special case: market basket analysis PAGE 18
- 20. Example (people that order tea and latte also order muffins)• Support should be as high as possible (but will be low in case of many items).• Confidence should be close to 1.• High lift values suggest a positive correlation (1 if independent). PAGE 19
- 21. Brute force algorithm PAGE 20
- 22. Apriori (optimization based on twoobservations) PAGE 21
- 23. Sequencemining PAGE 22
- 24. Episode mining(32 time windows of length 5) a c b e d f c b b c a e b e c d c b 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 b b ba d a d c c c E1 E2 E3 PAGE 23
- 25. Occurrences b b b a d a d c c c E1 E2 E3 E2 (16x) a c b e d f c b b c a e b e c d c b 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 E1 E1 E3 PAGE 24
- 26. Hidden Markov models• Given an observation sequence, s state how to compute the probability of x observation the sequence given a hidden Markov model? 0.7 transition with probability• Given an observation sequence 0.5 observation probability and a hidden Markov model, how to compute the most likely 1.0 “hidden path” in the model? 0.7 0.2 0.3• Given a set of observation s1 s2 s3 0.8 sequences, how to derive the 0.5 0.5 0.6 0.4 0.8 0.2 hidden Markov model that maximizes the probability of producing these sequences? a b c d e PAGE 25
- 27. Relation between data mining andprocess mining• Process mining: about end-to-end processes.• Data mining: data-centric and not process-centric.• Judging the quality of data mining and process mining: many similarities, but also some differences.• Clearly, process mining techniques can benefit from experiences in the data mining field.• Let us now focus on the quality of mining results. PAGE 26
- 28. Confusion matrix logic ≥8 - failed <8 program (79/10) ming ≥7 linear <7 cum laude predicted algebra ≥6 <6 linear (20/2) class algebra ≥6 cum laude passed operat. (87/11) passed <6 <6 research ≥6 passed failed (31/7) failed failed passed (20/4)(101/8) (82/7) failed 178 22 0 actual class passed 21 175 2 cum laude 1 3 18 PAGE 27
- 29. Confusion matrix: metrics predicted name formula class error (fp+fn)/N + - accuracy tp-rate (tp+tn)/N tp/p + tp fn p fp/n actual fp-rate class - fp tn n precision tp/p’ recall tp/p p’ n’ N (a) (b)tp is the number of true positives, i.e., instances that are correctly classified as positive.fn is the number of false negatives, i.e., instances that are predicted to be negative butshould have been classified as positive.fp is the number of false positives, i.e., instances that are predicted to be positive but shouldhave been classified as negative. PAGE 28tn is the number of true negatives, i.e., instances that are correctly classified as negative.
- 30. Example #young=546 young Overall E = 0.946848 #old=314 E=0.946848 (860/303) information gain is 0.107012 split on attribute smoker #young=184 Overall E = 0.839836 #old=11 smoker yes noE = 0.313027 information gain is 0.076468 #young=362 young young #old=303 (195/11) (665/303) E=0.994314 predicted predicted class class split on attribute drinker young young old old #young=184 Overall E = 0.763368 #old=11 smoker yes no young 546 0 young 544 2 actual actual class classE = 0.313027 young drinker #young=2 #old=63 old 314 0 old 251 63 (195/11) yes no E=0.198234 (a) (b) #young=360 young old #old=240 (600/240) (65/2) E=0.970951 PAGE 29
- 31. Cross-validation learning algorithm model training set split test data set performance test set indicator PAGE 30
- 32. k-fold cross-validation learning algorithm model split test data set performance indicator k data sets rotate PAGE 31
- 33. Occam’s Razor• Principle attributed to the 14thcentury English logician William of Ockham.• The principle states that “one should not increase, beyond what is necessary, the number of entities required to explain anything”, i.e., one should look for the “simplest model” that can explain what is observed in the data set.• The Minimal Description Length (MDL) principle tries to operationalize Occam’s. In MDL performance is judged on the training data alone and not measured against new, unseen instances. The basic idea is that the “best” model is the one that minimizes the encoding of both model and data set. PAGE 32

No public clipboards found for this slide

Be the first to comment