Lecture16 - Advances topics on association rules PART III
Upcoming SlideShare
Loading in...5
×
 

Lecture16 - Advances topics on association rules PART III

on

  • 1,894 views

 

Statistics

Views

Total Views
1,894
Views on SlideShare
1,879
Embed Views
15

Actions

Likes
2
Downloads
137
Comments
0

2 Embeds 15

http://www.albertorriols.net 12
http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Lecture16 - Advances topics on association rules PART III Lecture16 - Advances topics on association rules PART III Presentation Transcript

  • Introduction to Machine Learning Lecture 16 Advanced Topics in Association Rules Mining Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull
  • Recap of Lecture 13-15 Ideas come from the market basket analysis ( y (MBA) ) Let’s go shopping! Milk, eggs, sugar, bread Milk, eggs, cereal, Eggs, sugar bread bd Customer1 Customer2 Customer3 What do my customer buy? Which product are bought together? Aim: Find associations and correlations between t e d e e t d assoc at o s a d co e at o s bet ee the different items that customers place in their shopping basket Slide 2 Artificial Intelligence Machine Learning
  • Recap of Lecture 15 Aim: Find associations between items But wait! There are many different diapers Dodot, Huggies … gg There are many different beers: heineken, desperados, king fisher … in bottle/can … , p , g Clothes Which rule do you prefer? diapers ⇒ beer Outwear Shirts dodot diapers M ⇒ Dam beer in Can Jackets Ski Pants Which will have greater support? Slide 3 Artificial Intelligence Machine Learning
  • Today’s Agenda Continuing our journey through some advanced topics in ARM Mining frequent patterns without candidate generation Multiple Level AR Sequential Pattern Mining Quantitative association rules Mining class association rules Beyond support & confidence B d t fid Applications Slide 4 Artificial Intelligence Machine Learning
  • Introduction to Seq. AR So far, we have seen , Apriori Fp-growth F th Mining multiple level AR But none of them consider the order of transactions However, However is the sequence important? Whether the hen or the egg? Sometimes, really important Analyze the sequence of items bought buy a customer Web usage mining searches for navigational patterns of users Slide 5 Artificial Intelligence Machine Learning
  • An Example in Web Usage Mining Web sequence: < {Homepage} {Electronics} {Computers} {Laptops} {Sony Vaio} {Order Confirmation} {Return to Shopping} > Slide 6 Artificial Intelligence Machine Learning
  • Definition Defining the problem: g p Let I = {i1, i2, …, im} be a set of items Sequence: A ordered li t of itemsets S An d d list f it t Itemset/element: A non-empty set of items X ⊆ I. We denote a sequence s b < 1a2…ar> where ai i an it by <a >, h is itemset, which i also t hi h is l called an element of s An l A element ( an it t (or itemset) of a sequence is denoted by { 1, x2, t) f id t d b {x …, xk}, where xj ∈ I is an item We W assume without loss of generality th t it ith t l f lit that items in an element i l t of a sequence are in lexicographic order Slide 7 Artificial Intelligence Machine Learning
  • Definition Defining the problem: g p Size: The size of a sequence is the number of elements (or itemsets) in the seque ce e se s) e sequence Length: The length of a sequence is the number of items in the seque ce sequence A sequence of length k is called k-sequence A sequence s1 = 〈 1a2…ar〉 i a subsequence of another 〈a is b f th sequence s2 = 〈b1b2…bv〉, or s2 is a supersequence of s1, if there e st integers 1 ≤ j1 < j2 < … < jr 1 < jr ≤ v such t at a1 ⊆ t e e exist tege s suc that r−1 bj1, a2 ⊆ bj2, …, ar ⊆ bjr. We also say that s2 contains s1 Slide 8 Artificial Intelligence Machine Learning
  • Example Let I = {1, 2, 3, 4, 5, 6, 7, 8, 9}. {, , , , , , , , } Sequence 〈{3}{4, 5}{8}〉 is contained in (or is a subsequence of) 〈{6} {3 7}{9}{4 5 8}{3 8}〉 {3, 7}{9}{4, 5, 8}{3, because {3} ⊆ {3, 7}, {4, 5} ⊆ {4, 5, 8}, and {8} ⊆ {3, 8}. However, 〈{3}{8}〉 is not contained in 〈{3, 8}〉 or vice versa. The size of the sequence 〈{3}{4, 5}{8}〉 is 3, and the length of the sequence is 4 Slide 9 Artificial Intelligence Machine Learning
  • Objective Objective of sequential pattern mining (SPM) j q p g( ) Input: A set S of input data sequences (or sequence database) Goal: the G l th problem of mining sequential patterns i t fi d all th bl f ii ti l tt is to find ll the sequences that have a user-specified minimum support Each E h such sequence is called a frequent sequence, or a h i ll d f t sequential pattern The support for a sequence is the fraction of total data sequences in S that contains this sequence Slide 10 Artificial Intelligence Machine Learning
  • Example Customer Transaction Transaction Customer Customer Sequence ID time (items bought) ID 1 July 20, 2005 30 1 < (30) (90)> 1 July 25, 2005 90 2 <(10 20) (30) (40 60 70)> 2 July 9, 2005 y, 10, 20 , 3 <(30 50 70)> ( ) 2 July 14, 2005 30 4 <(30) (40 70) (90)> 2 July 20, 2005 40,60,70 5 <(90)> 3 July 25, 2005 30,50,70 4 July 25, 2005 30 4 July 29, 2005 y, 40, 70 , 4 August 2, 2005 90 5 July 12, 2005 90 Sequential patterns with support >25% 1-sequence < (30)> <(40)> <(70)> <(90)> 2-sequence <(30)(40)> <(30)(70)><(30)(90)><(40 70)> 3-sequence <(30) (40 70)> Example borrowed from Bing Liu Slide 11 Artificial Intelligence Machine Learning
  • GSP GSP follows closely Apriori but for sequential patterns yp q p If a sequence S is not frequent, then none of the super- seque ces of s eque sequences o S is frequent For instance, if <ab> is infrequent so do <acb> and <(ca)b> GSP follows the next steps: f ll th tt Initially, every item in DB is a candidate of length-1 For each level (i.e., sequences of length-k) do Scan database to collect support count for each candidate sequence Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori Repeat until no frequent sequence or no candidate can be found Strength: Candidate pruning by Apriori Slide 12 Artificial Intelligence Machine Learning
  • The Algorithm Does this remind you Apriori? Slide 13 Artificial Intelligence Machine Learning
  • Quantitative AR Transaction ID Age Married NumCars 1 23 No 1 2 25 Yes 1 3 29 No 0 4 34 Yes 2 5 38 Yes Y 2 <Age: 30..39> and <Married: Yes> => <NumCars: 2> Support = 40% Conf = 100% 40%, How can we deal with these data? Slide 14 Artificial Intelligence Machine Learning
  • Map to Boolean Values Record Age g Age g Married Married NumCars NumCars ID [20..29] [30..39] Yes No 0 1 100 1 0 0 1 0 1 200 1 0 1 0 0 1 300 1 0 0 1 1 0 400 0 1 1 0 0 0 500 0 1 1 0 0 0 Now, Now use any system for mining boolean AR Apriori FP-growth Slide 15 Artificial Intelligence Machine Learning
  • Problems with this Approach MinSup If number of intervals is large, the support of a single interval can be lower MinConf Information lost during partition values into intervals. Confidence can be lower as number of intervals is smaller Example In the used partition: <NumCars:0> ⇒ <Married:No> c=100% But now, assume that in the partition, NumCars:0 and NumCars:1 go to the same interval <NumCars:0,1> ⇒ <Married:No> c=66.67% Slide 16 Artificial Intelligence Machine Learning
  • Problems with this Approach How we can solve this problem? Increase the number of intervals (to reduce information lost) while combining adjacent ones (t i hil bi i dj t (to increase support) t) ExecTime blows up as items per record increases ManyRules: Number of rules also blows up. Many of them will not be interesting Slide 17 Artificial Intelligence Machine Learning
  • Second Approach Other solutions? Well, the problem was that intervals were not the best ones Let’s t t L t’ try to create the best intervals f our d t t th b t i t l for data How? Discretizing/Clustering techniques Apply a discretizing/clustering technique to find the best y g g partitions Employ those partitions We’ll see how clustering techniques work in the next class. So, keep this in mind and p p pitch the p pieces together next class! g Slide 18 Artificial Intelligence Machine Learning
  • Third Approach And what if we do not map the input to a boolean p p space? Create interval based association interval-based rules directly So, So decide the best interval and and, then, count the support Usually, Usually these approaches do not provide all the association rules, but the ones with larger support and confidence f Fuzzy logics can also be applied here. But again, we’ll see GFS in two three lectures Slide 19 Artificial Intelligence Machine Learning
  • Mining Class Association Rules So far, we have seen ARM without any specific target , yp g It finds all possible rules that exist in data, i.e., any item can appear as a consequent or a condition of a rule However, what if we are interested in some specific targets? E.g.: Eg: The user has a set of text documents from some known topics. He/she wants to find out what words are associated or correlated with each topic So, now, we want to find: X ⇒ y, where X ⊆ I, and y ∈ Y The algorithms are very similar to those of ARM We are not going to see them in class. But you have information on the estudy Slide 20 Artificial Intelligence Machine Learning
  • Beyond Support and Confidence Support and Confidence are the basic measures of pp interestingness But many more have been proposed during the last few years Slide 21 Artificial Intelligence Machine Learning
  • Some Applications Wal-Mart has used the technique for years to mine POS data and arrange their store to maximize sales from such analysis Medical databases to discover commonly occurring diseases amongst groups of people Lottery results databases, to discover those lucky combinations of L tt lt d t b t di th lk bi ti f numbers Slide 22 Artificial Intelligence Machine Learning
  • Some Applications Power System Restoration y PSR is a multi-objective, multi-period, nonlinear, mixed integer op optimization p ob e with various co s a s a d a o problem a ous constraints and unforeseeable factors Discovering o assoc a o s that help bu d heuristics for PSR sco e g of associations a e p build eu s cs o S Actions in a PSR start_black_start_unit(x) start black start unit(x) energize_line(x) pick_up_load(x) pick up load(x) synchronize(x,y) connect_tie_line(x) connect tie line(x) crank_unit(x) energize_busbar(x) energize busbar(x) Slide 23 Artificial Intelligence Machine Learning
  • Some Applications Correlations with color, spatial relationships, etc. From coarse to Fine Resolution mining Slide 24 Artificial Intelligence Machine Learning
  • Next Class Clustering Slide 25 Artificial Intelligence Machine Learning
  • Introduction to Machine Learning Lecture 16 Advanced Topics in Association Rules Mining Albert Orriols i Puig http://www.albertorriols.net htt // lb t i l t aorriols@salle.url.edu Artificial Intelligence – Machine Learning g g Enginyeria i Arquitectura La Salle Universitat Ramon Llull