Text Mining, Association Rules and Decision Tree Learning

Decision Tree Learning
Supervised Learning
Adrian Cuyugan
Information Analytics

Multidisciplinary Subject
Statistic
s
Data Mining
Machine
Learning
AI
Text
Mining
Business
Process
Mining
Natural
Language
Processing
Database
Manageme
nt
Library Science
Mathematics
Computer
Science

Machine Learning
Supervised vs Unsupervised Learning
• Supervised learning assumes labeled data, i.e. there is response
variable that labels each record.
• Unsupervised learning, on the other hand, does not expect a response
variable because the algorithm itself can learn from the distinct patterns
within the data. Examples are clustering and pattern discovery.
Supervised Learning Techniques
• Regression techniques assume a numerical response variable. The
most frequently used is linear regression by minimizing the sum or errors.
• Classification techniques assume a categorical response variable. The
foundation of classification technique is decision tree algorithm.

Entropy
In other words, the algorithm splits the set
of instances in subsets such that the
variation within each subset becomes
smaller.
Entropy is an information-theoritic
measure for the uncertainly in a
multi-set of elements.
If the multi-set contains many
different elements and each element
is unique, then variation is maximal
and it takes many bits to encode the
individual elements. Hence, the
entropy is considered high.
If all elements, on the other hand,
are the same, then actually no bits
are needed to encode the individual
elements, hence it is a low entropy.
Decision
Y N

Entropy
Decision
Y N
Decision
Y N
Decision
High Entropy
High EntropyLow Entropy
Low Entropy
Low Entropy Low Entropy
Y N

Entropy
𝐸 = −
𝑖=1
𝑘
𝑐𝑖
𝑛
𝑙𝑜𝑔2
𝑐𝑖
𝑛
𝐸1 = −
3
6
𝑙𝑜𝑔2
3
6
+
3
6
𝑙𝑜𝑔2
3
6
𝐸1 = − 0.5 ∗ −1 + 0.5 ∗ −1
𝐸1 = 1

Entropy
𝐸 = −
𝑖=1
𝑘
𝑐𝑖
𝑛
𝑙𝑜𝑔2
𝑐𝑖
𝑛
𝐸2 = −
2
2
𝑙𝑜𝑔2
2
2
+ 0
𝐸2 = − 1 ∗ 0
𝐸2 = 0
𝐸3 = −
1
4
𝑙𝑜𝑔2
1
4
+
3
4
𝑙𝑜𝑔2
3
4
𝐸3 = − 0.25 ∗ −2 + 0.75 ∗ −0.415
𝐸3 = 0.811

Entropy
𝐸 = −
𝑖=1
𝑘
𝑐𝑖
𝑛
𝑙𝑜𝑔2
𝑐𝑖
𝑛
𝐸2 = −
2
2
𝑙𝑜𝑔2
2
2
+ 0
𝐸2 = − 1 ∗ 0
𝐸2 = 0
𝐸3.1 = −
1
1
𝑙𝑜𝑔2
1
1
+ 0
𝐸3.1 = − 1 ∗ 0
𝐸3.1 = 0 𝐸3.2 = − 0 +
3
3
𝑙𝑜𝑔2
3
3
𝐸3.2 = − 0 ∗ 1
𝐸3.2 = 0

Weighted Average Entropy
𝐸 𝑤1 =
6
6
∗ 1 = 1
𝐸 𝑤 =
𝑖,𝑗=1
𝑘
𝑐𝑖𝑗
𝑛
𝜃

𝐸2 = −
2
2
𝑙𝑜𝑔2
2
2
+ 0
𝐸2 = − 1 ∗ 0
𝐸2 = 0
𝐸3 = −
1
4
𝑙𝑜𝑔2
1
4
+
3
4
𝑙𝑜𝑔2
3
4
𝐸3 = − 0.25 ∗ −2 + 0.75 ∗ −0.415
𝐸3 = 0.811
𝐸 𝑤2 =
2
6
∗ 0 +
4
6
∗ 0.811 = 0.54
𝐸 𝑤 =
𝑖,𝑗=1
𝑘
𝑐𝑖𝑗
𝑛
𝜃

𝐸2 = −
2
2
𝑙𝑜𝑔2
2
2
+ 0
𝐸2 = − 1 ∗ 0
𝐸2 = 0
𝐸3.1 = −
1
1
𝑙𝑜𝑔2
1
1
+ 0
𝐸3.1 = − 1 ∗ 0
𝐸3.1 = 0
𝐸3.2 = − 0 +
3
3
𝑙𝑜𝑔2
3
3
𝐸3.2 = − 0 ∗ 1
𝐸3.2 = 0
𝐸 𝑤3 =
2
6
∗ 0 +
1
6
∗ 0 +
3
6
∗ 0 = 0
𝐸 𝑤
=
𝑖,𝑗=1
𝑘
𝑐𝑖𝑗
𝑛
𝜃

Information Gain
Decision
Y N
Decision
Y N
Decision
Y N
𝐸𝜇1
= 1
𝐼𝐺 = 𝐸𝜇 𝑇 − 𝐸𝜇(𝑇, 𝑎)
𝐸𝜇2
= 0.54 𝐸𝜇3
= 0
𝐼𝐺1,2 = 1 − 0.54 = 0.46
𝐼𝐺2,3 = 0.46 − 0 = 0.46 Stop!

Different Variations
Additional
Settings
• Minimal size of
the nodes
• Maximum depth
of the tree
• Bootstrapping at
nodes
• Setting minimal
threshold of IG
• Using Gini Index
than Information
Gain
• Post-pruning of
the tree
Different Algorithms
• ID3 (Iterative Dichotomiser 3)
First decision tree classifier.
• CART (Classification and Regression Trees)
A binary classifier. The generic decision tree learning algorithm
like in the example.
• C4.5 and C5.0
Can handle numerical independent variables. The latter offers
more computational speed and varies in splitting rule.
• CHAID (Chi-square Automatic Interaction Detector)
Uses significance testing in splitting.
• Ensembles i.e. Random Forest, Ada Boost, Gradient Boosting
Uses bagging, bootstrapping and weighting. Very flexible and
most recent innovations in decision tree learning.

Suggested Topics to Read
1. Dividing datasets for model evaluation
a) Training and testing sets
b) Cross-validation
2. Confusion matrix for binary classifiers
a) True Positive and True Negative
b) False Positive and False Negative
3. Quality measures in evaluating classification models
a) Error and Accuracy
b) Precision and Recall
c) F1 score (harmonic mean)
d) ROC Chart
e) Area Under the Curve
4. Ensemble methods
5. Bootstrapping and resampling statistics

Text Mining and Analytics
Unsupervised Learning
Adrian Cuyugan
Information Analytics

Text Mining Overview
Data Extraction
• File Types and
Sources (Spreadsheet,
Word Documents,
HTML, JSON, API,
etc.)
• Regular expressions
• Data File Systems
(RDBMS, Google File
System, Hadoop,
MapReduce)
Information
Retrieval
• Intro to Natural
Language Analysis
• Vector Space Model –
Bag of Words
• Term Frequency Matrix
• Inverted Document
Frequency Matrix
• TF-IDF Matrix
• Stop words and
Stemming
• Document Length
Normalization (PL2,
Okapi/BM25)
• Evaluation (Average
Precision, Reciprocal
Rank, F-meaure and
nDCG)
• Query Likelihod,
Statistical Language
Probability Unigram
Language Model
• Rocchio Feedback and
KL Divergence
• Recommender
Systems
Pattern Analysis
• Pattern Discovery
Concepts (Frequent,
Closed and Max)
• Association Rules
• Quantitave Measures
(Support, Confidence
and Lift)
• Other measures
• Apriori Algorithm,
ECLAT and FPGrowth
Algorithms
• Multi-level and Multi-
dimensional levels,
Compressed and
Colossal Patterns
• Sequential Patterns
• Graph Patterns
• Topic Modelling for
Text Data
Clustering
• Partitioning,
Hierarchical and
Density based
methods
• Spectral Clustering
• Probabilistic Models
and EM Algorithm
• Evaluating Clustering
Models
• Clustering streaming
data
• Graph Theory
• Social Network
Analysis
Analytics
• Text clustering,
categorization and
summarization
• Topic-based modelling
• Sentiment analysis
• Integration of free-form
text and structured
data
Visualization
• Basic charts and
graphs
• Animating and
interactivity
• Visualizing
relationships
(hierarchies, clusters
and networks)
• Visualizing text

Text Retrieval

Natural Language Analysis
The quick brown fox jumped over the lazy dog.
Pragmatic Analysis
article adjective adjective noun verb preposition article adjective noun
Prepositional phraseNoun phrase
Subject Predicate
Lexical Analysis
(part of speech tagging)
Syntactic Analysis
(parsing)
Semantic Analysis fox (f1) dog (d1) jump (f1, d1)
How quick was the fox that it jumped over the dog.
Could the dog escaped the quick fox if it wasn’t lazy?
Why did the fox jump over the dog?

Vector Space Model
Document (d) The quick brown fox jumped over the lazy dog.
Query (q) How many times does “dog” occur in document?
Term frequency (tf)
Count of query in a document.
Example: count(“dog”, d)
Document length |d| How long is the document?
Document frequency
(df)
How often do we see “dog” in the entire collection?
Example: df(“dog”) = p(“dog” | collection)

Simplest VSM Bag of Words
𝑉𝑆𝑀 𝑞, 𝑑 = 𝑞. 𝑑
= 𝑥1 𝑦1 + … + 𝑥 𝑛 𝑦𝑛
=
𝑖=1
𝑛
𝑥1 𝑦1
𝒒 quick fox over dog
𝑑1 … The quick brown …
𝑑2 The quick brown and over cunny fox…
𝑑3 … the fox is brown and quick…
𝑑4 The quick brown fox … fox … over…
𝑑5
The quick fox … the … the … over brown fox ...
fox
How would you order of
ranking of documents based
on
bit-vector term frequency?
𝑥1, 𝑦1 ∈ 0,1
1 = word is present
0 = word is absent

Bit-Vector Term Frequency Matrix
𝑑5
The quick fox … the … the … over
brown fox … fox
∈ the quick
brow
n
and is cunny fox over dog
q 0 1 0 0 0 0 1 1 1
𝑑1 0 1 0 0 0 0 0 0 0
𝑑2 0 1 0 0 0 0 1 1 0
𝑑3 0 1 0 0 0 0 1 0 0
𝑑4 0 1 0 0 0 0 1 1 0
𝑑5 0 1 0 0 0 0 1 1 0
𝑓 𝑞, 𝑑1 = 0 ∗ 0 + 1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 = 1
𝑓 𝑞, 𝑑3 = 0 ∗ 0 + 1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 = 2
𝑓 𝑞, 𝑑5 = 0 ∗ 0 + 1 ∗ 1 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 0 ∗ 0 + 1 ∗ 1 + 1 ∗ 1 + 0 ∗ 0 = 3
the quick brown and is cunny fox over dog
𝑥1, 𝑦1 ∈ 0,1
1 = word is present
0 = word is absent

Raw Term Frequency Matrix
𝑑2
The quick brown and over cunny
fox…
𝑑4
The quick brown fox … fox …
over…
𝑑5
The quick fox … the … the … over
brown fox … fox
∈ the quick brown and is cunny fox over dog 𝒇(𝒒, 𝒅 𝒏)
q 0 1 0 0 0 0 1 1 1
𝑑1 0 1 0 0 0 0 0 0 0 1
𝑑2 0 1 0 0 0 0 1 1 0 3
𝑑3 0 1 0 0 0 0 1 0 0 2
𝑑4 0 1 0 0 0 0 2 1 0 4
𝑑5 0 1 0 0 0 0 3 1 0 5
𝑥1, 𝑦1 ∈ 0, +∞
∞ = sum of terms
0 = word is absent

Limitation of Term Frequency
𝑑5 The quick fox … the … the … over brown fox … fox
• fox deserves more credit to the matrix.
• fox is perceived to have higher importance than compared to over.

TF Weighting Matrix
𝑑2
fox…
𝑑4
over…
𝑑5
The quick fox … the … the …
over brown fox … fox
q 0 1 0 0 0 0 1 1 1
w 0 2.0 0 0 0 0 5.0 1.0 5.0
𝑑1 0 1 0 0 0 0 0 0 0 2.0
𝑑2 0 1 0 0 0 0 1 1 0 8.0
𝑑3 0 1 0 0 0 0 1 0 0 7.0
𝑑4 0 1 0 0 0 0 2 1 0 13.0
𝑑5 0 1 0 0 0 0 3 1 0 18.0
𝑥1 ∈ 0, +∞
∞ = sum of tf
0 = word is absent
𝑦1 ∈ 0, ∞
∞ = weight of term
0 = word is absent

Inverse Document Frequency w/ Smoothing
𝑇𝐹𝐼𝐷𝐹 = log
𝑀 + 1
𝑘
𝑀 = total number of docs in collection
𝑘 = document frequency
𝑇𝐹𝐼𝐷𝐹
𝑀
𝑘

Term Frequency-Inverse Document Frequency
∈ the quick brown and is cunny fox over dog
M 5 5 5 5 5 5 5 5 5
k 5 5 5 1 1 1 4 3 0
IDF 0.08 0.08 0.08 0.78 0.78 0.78 0.18 0.30 0.00
𝑑2
fox…
𝑑4
over…
𝑑5
The quick fox … the … the …
over brown fox … fox
𝑑1 0.000 0.079 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.08
𝑑2 0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.301 0.000 0.56
𝑑3 0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.000 0.000 0.26
𝑑4 0.000 0.079 0.000 0.000 0.000 0.000 0.352 0.301 0.000 0.73
𝑑5 0.000 0.079 0.000 0.000 0.000 0.000 0.528 0.301 0.000 0.91
𝑇𝐹𝐼𝐷𝐹 = log
𝑀 + 1
𝑘
𝑇𝐹𝐼𝐷𝐹 =
𝑖=1
𝑛
𝑥 𝑛 𝑦𝑛 𝑙𝑜𝑔
𝑀 + 1
𝑘

Comparing Matrices
𝑑5
The quick fox … the … the … over brown fox …
fox
∈
Bit-Vector
TF
Term
Frequency
TF
Weighting
TF-IDF
𝑑1 1 1 2.0 0.08
𝑑2 3 3 8.0 0.56
𝑑3 2 2 7.0 0.26
𝑑4 3 4 13.0 0.73
𝑑5 3 5 18.0 0.91

Stop Words
First person
• I, me, myself
• We, us, ourselves
Second person
• You, yours, yourself,
yourselves
Third person
• He, him, his, himself
• She, her, hers, herself
• It, its, itself
• They, them, themselves
Interrogatives and
Demonstratives
• What, which, who, whom
• This, that, those, these
Be
•Am, is, are, were
•Be, been, being
Have
•Have, has, had, having
Do
•Do, does, did, doing
Auxiliary
•Will, would, shall,
should, can, could
•May, might, must, ought
Pronoun + Verb
• I’m, you’re, she’s, they’d,
we’ll
Verb + Negation
• Isn’t, aren’t, haven’t,
doesn’t, didn’t
Auxiliary + Negation
• Won’t, wouldn’t, can’t,
cannot, mustn’t
• Daren’t, oughtn’t
Miscellaneous
• Let’s, there’s, how’s,
what’s, here’s
Articles /
Determiners
• A, an, the
Conjunctions
• For, an, nor, but,
or, yet, so
Prepositions
• In, under, towards,
before
Common
• Get, go, whether,
like, however, also
PronounsVerbsCompoun
d

Stemming
Original Such an analysis can reveal feature that are not easily visible
from the variations in the individual genes and can lead to a
picture of expression that is more biologically transparent and
accessible to interpretation
Lovins such an analys can reve featur ar not eas vis from th vari in the
individu gen and can lead to a pictur of expres that is mor
biolog transpar and access to interpres
Paice such an analys can rev feat that are not easy vis from the vary
in the invdivid gen and can lead to a pict of express that is mor
biolog transp and access to interpret
Porter such an analysi can reveal featur that ar not easili visibl from
the variat in the individ gene and can lead to a pictur of express
that is more biolog transpar and access to interpret

Association Rules

Pattern Discovery
What is Pattern Discovery?
• A pattern is a set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set.
• Patterns represent intrinsic and important properties of data sets.
• Pattern discovery – uncovers patterns from massive data sets.
Why do Pattern Discovery?
• Foundation for many essential data mining tasks
• Association, correlation, and causality analysis
• Mining sequential, structural (e.g., sub-graph) patterns
• Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
• Classification: Discriminative pattern-based analysis
• Cluster analysis: Pattern-based subspace clustering

Pattern Discovery
Motivation Use
• Which products were often purchased together?
• What are the subequent purchases after buying an iPhone?
• What software scripts likely contain copy-and-paste expression?
• What word sequences likely form phrases in the corpus?
Applications
• Market basket analysis, cross-marketing, sale campaign analysis,
Web log analysis, biochemistry sequence analysis.

Frequent Itemsets
ID Product Names
10 Outlook, SAP, Active Directory
20 Outlook, Desktop, Active Directory
30 Outlook, Active Directory, Sharepoint
40 SAP, Sharepoint, Voicemail
50 SAP, Desktop, Active Directory, Sharepoint,
Voicemail
Itemset – a set of one or more items
k-itemset – 𝑋 = 𝑥1 … 𝑥 𝑘
Absolute Support – frequency of
occurrences of an itemset 𝑋.
Relative Support – the fraction of
transactions that contains 𝑋.
An itemset is frequent if the support of 𝑋
is no less than the 𝑚𝑖𝑛𝑠𝑢𝑝 threshold.
This is denoted as 𝜎.
Let 𝜎 = 50%
Frequent 1-itemsets:
Outlook: 3 (60%)
SAP: 3 (60%)
Active Directory: 4 (80%)
Sharepoint: 3 (60%)
Frequent 2-itemsets:
{Outlook, Active Directory}: 3 (60%)

Association Rules
ID Product Names
50
SAP, Desktop, Active Directory, Sharepoint,
Voicemail
Outlook
Active
Directory
Outlook
∪
Active
Directory
{Outlook} ∪ {Active Directory} = {Outlook, Active Directory}
Support (s) – The probability
that a transaction contains 𝑋 ∪
𝑌.
𝑋 ⇒ 𝑌 – denotes if … then
Confidence (c) – The conditional probability that
a transaction containing 𝑋 also contains 𝑌.
𝑐(𝑋 ⇒ 𝑌) =
s(𝑋∪𝑌)
s(𝑋)
= 𝑃(𝑌|𝑋) =
𝑃(𝑌∩𝑋)
𝑃(𝑋)

Association Rule Mining
ID Product Names
50
Voicemail
Frequent itemsets – finding items that
meet the 𝜎 threshold.
Association rule mining – finding all
the rules that meet both support and
confidence, 𝑋 ⇒ 𝑌.
1-
itemsets
Outlook: 3 (60%)
SAP: 3 (60%)
Active Directory: 4 (80%)
Sharepoint: 3 (60%)
2-
itemsets
{Outlook, Active Directory}: 3
(60%)
Frequent itemsets
𝑚𝑖𝑛𝑠𝑢𝑝 = 50%
Association rule
𝑚𝑖𝑛𝑠𝑢𝑝 = 50% 𝑎𝑛𝑑 𝑚𝑖𝑛𝑐𝑜𝑛𝑓 = 50%
Outlook ⇒ Active Directory: (60%, 100%)
Active Directory ⇒ Outlook: (60%, 75%)

Downward Closure of Frequent Patterns
Scenario:
• A database contains two transactions with itemsets: 𝑇1 𝑎1, … 𝑎50 ; 𝑇2 𝑎1, … 𝑎100 .
• We get a frequent itemset: 𝑎1, … 𝑎50 .
• Also, its subsets are all frequent: 𝑎1 , 𝑎2, … 𝑎50 , … 𝑎1, 𝑎2 , … 𝑎1, … 𝑎49 , …
• This is equivalent to 50! = 30 vignitillion = 3.0464
.
Efficient mining:
• If {Outlook, SAP, Active Directory} is frequent, so is {Outlook, Active Directory}.
• So, every transaction containing {Outlook, SAP, Active Directory} also contains
{Outlook, Active Directory}.
• Any subset of a frequent itemset must be frequent.
• So, if any subset of an itemset 𝑆 is infrequent, then there is no chance for 𝑆 to be
frequent.

Limitation of Support-Confidence Framework
Scenario:
• {Active Directory} ⇒ {Password
Reset}
• 𝑠, 𝑐 = 40%, 67%
Active Directory ¬Active Directory Sum of Rows
Password Reset 400 350 750
¬Password Reset 200 50 250
Sum of Columns 600 400 1000
• {¬Active Directory } ⇒ {Password Reset
}
• 𝑠, 𝑐 = 35%, 88%

Lift
𝑙𝑖𝑓𝑡(Active Directory, Password Reset) =
400
1000
600
1000∗750
1000
= 0.89
𝑙𝑖𝑓𝑡 X, Y =
𝑐 𝑋 ⇒ 𝑌
𝑠(𝑌)
=
𝑠(𝑋 ∪ 𝑌)
𝑠 𝑋 ∗ 𝑠(𝑌)
𝑙𝑖𝑓𝑡(𝑋, 𝑌) = 1 is independent
= > 1 is positively correlated
= < 1 is negatively correlated
𝑙𝑖𝑓𝑡(¬Active Directory, Password Reset) =
350
1000
400
1000∗250
1000
=
1.16

Expected Value for Chi-Square
𝑥2 =
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 2
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
𝑥2
= 0 is independent
= > 0 correlated, either positive
or negative, therefore it needs more
tests
𝐸𝑖.𝑗 =
𝑇𝑖 𝑇𝑗
𝑇𝑜𝑡𝑎𝑙𝑠
𝑇𝑖 = total of ith row
𝑇𝑗 = total of ith column
𝐸1.1 =
750 ∗ 600
1000
= 450
𝐸1.2 =
750 ∗ 400
1000
= 300
𝐸2.1 =
250 ∗ 600
1000
= 150
𝐸2.2 =
750 ∗ 400
1000
= 100

Chi-Square
Password Reset 400 (450) 350 (300) 750
¬Password Reset 200 (150) 50 (100) 250
𝑥2
=
400 − 450 2
450
+
350 − 300 2
300
+
200 − 150 2
150
+
50 − 100 2
100
= 55.56
𝑥2 =
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 2
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
𝑥2
shows Active Directory and Password Reset are negatively
correlated since the expected value is higher than the observed value.

Apriori Algorithm Pseudo Code
𝐶 𝑘 = candidate itemset of k
𝐹𝑘 = frequent itemset of k
𝑘 = 1;
𝐹𝑘 = {frequent items}; // frequent 1-itemset
While (𝐹𝑘 ! = 0) do { // when 𝐹𝑘 is non-zero
𝐶 𝑘+1 = candidates generated from 𝐹𝑘; // candidate generation
Derive 𝐹𝑘+1by counting candidates in 𝐶 𝑘+1 with respect to database at 𝜎;
𝑘 = 𝑘 + 1;
}
return ∪ 𝑘 𝐹𝑘 // return 𝐹𝑘 generated at each
level

Apriori Algorithm
ID Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Let 𝜎 = 2
Itemset support
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset support
{A} 2
{B} 3
{C} 3
{E} 3
Itemset support
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset support
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset support
{B, C, E} 2
1st scan
𝐶1
1st scan
𝐹1
2nd scan
𝐶2
2nd scan 𝐹2
3rd scan
𝐶3
𝐶 𝑘 = candidate itemset of k
𝐹𝑘 = frequent itemset of k
𝑘 = 1;
𝐹𝑘 = {frequent items};
While (𝐹𝑘 ! = 0) do {
𝐶 𝑘+1 = candidates generated from 𝐹𝑘;
Derive 𝐹𝑘+1by counting candidates in 𝐶 𝑘+1 with respect to database at 𝜎;
𝑘 = 𝑘 + 1;
}
return ∪ 𝑘 𝐹𝑘

Transactions Sparse Matrix
ID Product Names Outlook SAP
Active
Directory
Desktop
Sharepoi
nt
Voicemail
10 Outlook, SAP, Active Directory 1 0 1 0 0 0
20 Outlook, Desktop, Active Directory 1 0 1 1 0 0
30 Outlook, Active Directory, Sharepoint 1 0 1 0 1 0
40 SAP, Sharepoint, Voicemail 0 1 0 0 1 1
50
Voicemail
0 1 1 1 1 1

Text Mining, Association Rules and Decision Tree Learning

More Related Content

Viewers also liked

Similar to Text Mining, Association Rules and Decision Tree Learning

Text Mining, Association Rules and Decision Tree Learning

Editor's Notes