2. About me
• Education
• NCU (MIS)、NCCU (CS)
• Work Experience
• Telecom big data Innovation
• AI projects
• Retail marketing technology
• User Group
• TW Spark User Group
• TW Hadoop User Group
• Taiwan Data Engineer Association Director
• Research
• Big Data/ ML/ AIOT/ AI Columnist
2
4. Supervised learning vs. Unsupervised learning
• Supervised learning: Discover patterns in the data that relate data
attributes with a target (class) attribute.
• These patterns are then utilized to predict the values of the target attribute in
future data instances.
• Unsupervised learning: The data have no target attribute.
• We want to explore the data to find some intrinsic structures in them.
• Classic unsupervised learning algorithm
• Clustering algorithms (Inductive/ Transductive learning)
• Association rules (also called Market Basket Analysis)
4
10. Transform data before K-means
• Many statistical tests make the assumption that datasets are normally
distributed.
• However, this is often NOT the case in practice.
• Transformations:
• Log Transformation: Transform the response variable from y to log(y).
• Square Root Transformation: Transform the response variable from y to y1/2.
• Cube Root Transformation: Transform the response variable from y to y1/3.
16. Standardize Data
• Standardization (Z-scores) rescales a
dataset to have a mean of 0 and a
standard deviation of 1.
• We typically standardize data when we’d
like to know how many standard
deviations each value in a dataset lies
from the mean.
17. Normalize Data
• Normalization rescales a dataset so that
each value falls between 0 and 1.
• Typically we normalize data when
performing some type of analysis in
which we have multiple variables that
are measured on different scales and we
want each of the variables to have the
same range.
18. Quiz 2:
• When conducting K-means, how should categorical variable be
handled?
• When conducting K-means on numerical variables with severe
skewness distribution, how to handle with it?
• If we have segmented several groups by the re-scaled data, how to
proceed new data and group assignment? (Using K-means)
• Answer 1: Union all data, and rebuild model again.
• Answer 2: ??
20. Example of Cluster Analysis
• Retail Marketing
• The company can then send personalized advertisements or sales letters to
each household based on how likely they are to respond to specific types
of advertisements.
21. Example of Cluster Analysis
• Streaming Services
• Using these metrics, a streaming service can perform cluster analysis
to identify high usage and low usage users so that they can know who
they should spend most of their advertising dollars on.
22. Example of Cluster Analysis
• Sports Science
• They can then feed these variables into a clustering algorithm to
identify players that are similar to each other so that they can have
these players practice with each other and perform specific drills
based on their strengths and weaknesses.
23. Example of Cluster Analysis
• Email Marketing
• Using these metrics, a business can perform
cluster analysis to identify consumers who use
email in similar ways and tailor the types of
emails and frequency of emails they send to
different clusters of customers.
https://email.uplers.com/blog/email-segmentation-recipe-great-email-marketing/
24. Example of Cluster Analysis
• Health Insurance
• An actuary can then feed these variables into a clustering algorithm
to identify households that are similar. The health insurance company
can then set monthly premiums based on how often they expect
households in specific clusters to use their insurance.
26. Association Rules
• In a transaction database with a large amount of data, look
for items correlations.
• The classic story of Walmart diapers and beer.
• Selling these two unrelated products together can actually increase
sales.
26
In general, the correlations can’t be obtained through direct observation, but through algorithms.
27. Association Rules
• Two steps as below.
• First, obtain the frequent item sets!
• A collection of items that often appear together.
• Utilizing Apriori algorithm.
• Second, generate Association Rules from frequent item sets!
• There may be strong correlations based on frequent item sets.
• Must meet the definition such like Min Supportance or Min confidence.
27
28. Association Rules
• From sales database, we found {B, C, E} items have high
correlation. That is called frequent item sets.
• According to {B, E} are likely to be purchased together, that is
called strength of association.
• How strong of association, we estimate Supportance and
Confidence.
28
29. Association Rules
• Supportance
• If the total transaction data has 200 records, and the item Sausage
has 20 records, then its Supportance is 50/200 = 1/4, that is, the
support of sausage is 25%.
• Confidence range: [0, 1].
• Indicates the conditional probability of two items appearing at the
same time. Simply put, it is the probability of item A appearing
when item B has already appeared.
29
Confidence(A -> B) =
• P(A|B): The probability that A will occur
under the conditions that B occurs
• P ( A ∩ B ) or P ( A , B ) or P ( A B ) : The
probability that two events will occur together
30. Association Rules
• Min Supportance and Min Confidence:
• Generally, we define support as 50%, which means that the purchased
product set {A, B} appears in at least 50% of the total times before it is
considered a frequent item set.
30
If the Supportance/ Confidence is set too low, too many
association rules will appear in the results.
If it is too high, there will be too few association rules,
which is not conducive to us making decisions based on
the association results. 。
31. Association Rules
• Outputs:
• A bunch of rules are generated, we use to sorting by Supportance or
Confidence to find what we are interesting.
39. Metrics Description
Supportance how often a rule is applicable to a given data set (rule/data)
Confidence how frequently items in Y appear in transactions with X or in other words how
frequently the rule is true (support for a rule/support of antecedent)
Coverage how often antecedent item is found in the data set (support of antecedent/data)
Strength (support of consequent/support of antecedent)
Lift how frequently a rule is true per consequent item (data * confidence/support of
consequent)
Leverage the difference between two item appearing in a transaction and the two items
appearing independently (support*data - antecedent support * consequent
support/data2)
40.
41.
42. Association Rules analysis
A logical step would be to place Wine closer to the (Nuts, Aspirin, Pancakes) section
The condition holds when looking from the left Antecedent toward on the right Consequent, but NOT in reverse!
44. Home works
• Modifying the file format (20231121_hw.csv) to a format compatible
with Orange 3 Association Rules.
• Please identify what have you discovered any
interesting association rules?
The first row is all item names, go
allover purchase item and mark the
values 1; otherwise mark as ? (not 0)