SlideShare a Scribd company logo
Customer Clustering for
Retailer Marketing
A worked example of machine learning in industry,
with reference to useful R packages
Talk for Dublin R User Group - 06 Nov 2013
Jonathan Sedar - Consulting Data Scientist
@jonsedar
Who’s shopping at my stores and how
can I market to them?
Lets try to group them by similar
shopping behaviours
Overview
Intro What’s the problem we’re trying to solve?
Sourcing, Cleaning & Exploration
What does the data let us do?

read.table {utils}
ggplot {ggplot2}
lubridate {lubridate}

Extract additional information to enrich the set

data.table {data.table}
cut2 {Hmisc}
dcast {reshape2}

Feature Selection

scale {base}
prcomp {stats}

Feature Creation

Reduce to a smaller dataset to speed up computation

Mixture Modelling
Finding similar customers without prior information
… and interpreting the results

mclust {mclust}
Intro

Customer profiling enables targeted
marketing and can improve operations
Retention offers
Product promotions
Loyalty rewards
Optimise stock levels &
store layout
Bob
Dublin 6
Age 42
Married?

Bob
Dublin 6
Age 42
Married?
Type: “Family First”
Intro

We want to turn transactional data into
customer classifications

32,000 customers
24,000 items
800,000
transactions
… over a 4 month pd

Magic

A real dataset:

Many, many ways to
approach the problem!
Intro

Practical machine learning projects
tend to have a similar structure
Source your
raw data

Exploration &
visualisation

Feature
selection

Model
creation

Cleaning &
importing

Feature
creation

Model
optimisation &
interpretation
Intro

Practical machine learning projects
tend to have a similar structure
Source your
raw data

Exploration &
visualisation

Feature
selection

Model
creation

Cleaning &
importing

Feature
creation

Model
optimisation &
interpretation
Sourcing, Cleansing, Exploration

So what information do we have?
Sourcing, Cleansing, Exploration

What information do we have?
“Ta-Feng” grocery shopping dataset
800,000 transactions
32,000 customer ids
24,000 product ids
4-month period over Winter 2000-2001
http://recsyswiki.com/wiki/Grocery_shopping_datasets

trans_date

cust_id

age res_area

product_id quantity price

1: 2000-11-01 00046855

D

E

4710085120468

3

57

2: 2000-11-01 00539166

E

E

4714981010038

2

48

3: 2000-11-01 00663373

F

E

4710265847666

1

135

...
Sourcing, Cleansing, Exploration

Data definition and audit (1 of 2)
A README file, excellent...
4 ASCII text files:
# D11: Transaction data collected in November, 2000
# D12: Transaction data collected in December, 2000
# D01: Transaction data collected in January, 2001
# D02: Transaction data collected in February, 2001

Curious choice of delimiter and an extended charset
# First line: Column definition in Traditional Chinese
#

§È¥¡;∑|≠˚•d∏π;¶~ƒ÷;∞œ∞Ï;∞”´~§¿√˛;∞”´~ΩsΩX;º∆∂q;¶®•ª;æP∞‚

# Second line and the rest: data columns separated by ";"

Pre-clean in shell
awk -F";" 'gsub(":","",$1)' D02
Sourcing, Cleansing, Exploration

Data definition and audit (2 of 2)
Although prepared by another researcher, can still find undocumented
gotchas:
# 1: Transaction date and time (time invalid and useless)
# 2: Customer ID
# 3: Age: 10 possible values,
#
A <25,B 25-29,C 30-34,D 35-39,E 40-44,F 45-49,G 50-54,H 55-59,I
60-64,J >65
# actually there's 22362 rows with value K, will assume it's Unknown
# 4: Residence Area: 8 possible values,
#

A-F: zipcode area: 105,106,110,114,115,221,G: others, H: Unknown

#

Distance to store, from the closest: 115,221,114,105,106,110

#

so we’ll factor this with levels "E","F","D","A","B","C","G","H"

# 5: Product subclass
# 6: Product ID
# 7: Amount
# 8: Asset
ignore

# not explained, low values, not an id, will
Sourcing, Cleansing, Exploration

Import & preprocess (read.table)
Read each file into a data.table, whilst applying basic data types
> dtnov <- data.table(read.table(fqn,col.names=cl$names,
colClasses=cl$types
,encoding="UTF-8",stringsAsFactors=F));

Alternatives inc RODBC / RPostgreSQL
> con <- dbConnect(dbDriver("PostgreSQL"), host="localhost", port=5432
,dbname=”tafeng”, user=”jon”, password=””)
> dtnov <- dbGetQuery(con,"select * from NovTransactions")
Sourcing, Cleansing, Exploration

Import & preprocess (lubridate)
Convert some datatypes to be more useful:
String -> POSIXct datetime (UNIX time UTC) using lubridate
> dtraw[,trans_date:= ymd(trans_date )]
> cat(ymd("2013-11-05"))
1383609600

… also, applying factor levels to the residential area
> dtraw[,res_area:= factor(res_area
,levels=c("E","F","D","A","B","C","G","H") )]
Sourcing, Cleansing, Exploration

Explore: Group By (data.table)
How many transactions, dates, customers, products and product subclasses?
> nrow(dt[,1,by=cust_id])

# 32,266

Using data.table’s dt[i,j,k] structure where:
i subselects rows

SQL WHERE

j selects / creates columns

SQL SELECT

k groups by columns

SQL GROUP BY

e.g. the above is:
select count (*)
from dt
group by cust_id
Sourcing, Cleansing, Exploration

Example of data logic-level cleaning
Product hierarchies: we assumed many product_ids to one product_category,
but actually … a handful of product_ids belong to 2 or 3 product_cats:
> transcatid <- dt[,list(nbask=length(trans_id)),by=list(prod_cat,
prod_id)]
> transid <- transcatid[,list(ncat=length(prod_cat),nbask=sum(nbask))
,by=prod_id]
> transid[,length(prod_id),by=ncat]
ncat

V1

1:

1 23557

2:

2

253

3:

3

2

Solution: dedupe. keep prod_id-prod_cat combos with the largest nbask
> ids <- transid[ncat>1,prod_id]
> transcatid[prod_id %in% ids,rank :=rank(-nbask),by=prod_id]
> goodprodcat <- transcatid[is.na(rank) | rank ==1,list(prod_cat,
prod_id)]
Sourcing, Cleansing, Exploration

Explore: Visualise (ggplot) (1 of 4)
e.g. transactions by date
p1 <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) +
geom_bar(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8)
plot(p1)
Sourcing, Cleansing, Exploration

Explore: Visualise (ggplot) (1.5 of 4)
e.g. transactions by date (alternate plotting)
p1b <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) +
geom_point(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8) +
geom_smooth(aes(x=trans_date,y=num_trans), method=’loess’,alpha=0.8)
plot(p1b)
Sourcing, Cleansing, Exploration

Explore: Visualise (ggplot) (2 of 4)
e.g. histogram count of customers with N items bought
p2 <- ggplot(dt[,list(numitem=length(trans_id)),by=cust_id]) +
geom_bar(aes(x=numitem),stat='bin',binwidth=10,alpha=0.8,fill=orange)
+
coord_cartesian(xlim=c(0,200))
plot(p2)
Sourcing, Cleansing, Exploration

Explore: Visualise (ggplot) (3 of 4)
e.g. scatterplot of total items vs total baskets per customer
p4a <- ggplot(dttt) +
geom_point(aes(x=numbask,y=numitem),size=1,alpha=0.8) +
geom_smooth(aes(x=numbask,y=numitem),method="lm")
plot(p4a)
Sourcing, Cleansing, Exploration

Explore: Visualise (ggplot) (4 of 4)
e.g. scatterplot of total items vs total baskets per customer per res_area
p5 <- ggplot(dttt) +
geom_point(aes(x=numbask,y=numitem,color=res_area),size=1,alpha=0.8)
+
geom_smooth(aes(x=numbask,y=numitem),method="lm",color=colorBlind[1])
+
facet_wrap(~res_area)
plot(p5)

A-F: zipcode area:
105,106,110,114,115,221
G: others
H: Unknown

Dist to store, from closest:
E < F < D < A <B < C
Feature Creation

Can we find or extract more info?
Feature Creation

Create New Features
Per customer (32,000 of them):
Counts:
# total baskets (==unique days)
# total items
# total spend
# unique prod_subclass,

unique prod_id

Distributions (min - med - max will do):
# items per basket
# spend per basket
# product_ids , prod_cats per basket
# duration between visits

Product preferences
# prop. of

baskets in the N bands of product cats & ids by item pop.

# prop. of baskets in the N bands of product ids by item price
Feature Creation

Counts
Pretty straightforward use of group by with data.table

> counts <- dt[,list(nbask=length(trans_id)
,nitem=sum(quantity)
,spend=sum(quantity*price))
,by=list(cust_id)]
> setkey(counts,cust_id)
> counts
cust_id nbask nitem spend
1: 00046855

1

3

171

2: 00539166

4

8

300

3: 00663373

1

1

135
Feature Creation

Distributions
Again, making use of group by with data.table and list to form new data table

> dists_ispb <- dt[,list(nitem=sum(quantity)
,spend=sum(quantity*price))
,by=list(cust_id,trans_date)]
> dists_ispb <- dists_ispb[,list(ipb_max=max(nitem)
,ipb_med=median(nitem)
,ipb_min=min(nitem)
,spb_max=max(spend)
,spb_med=median(spend)
,spb_min=min(spend))
,by=cust_id]
> setkey(dists_ispb,cust_id)
Feature Creation

Example considerations: it is
acceptable to lose datapoints?
Feature: duration between visits
If customers visited once only, they have value NA - issue for MClust
Solutions:
A: remove them from modelling? wasteful in this case (lose 30%!)
But maybe we don’t care about classifying one-time shoppers
B: or give them all the same value
But which value? all == 0 isn't quite true, and many will skew clustering
C: impute values based on the global mean and SD of each col
Usually a reasonable fix, except for ratio columns, where very clumsy and
likely misleading, requiring mirroring to get +ve axes
Feature Creation

Product Preferences (1 of 2) (Hmisc)
Trickier, since we don’t have a product hierarchy
e.g Food > Bakery > Bread > Sliced White > Brennans
But we do price per unit and inherent measure of popularity in the transaction
log, e.g.
> priceid <- dt[,list(aveprice=median(price)),by=prod_id]
> priceid[,prodid_pricerank:=LETTERS[as.numeric(cut2(aveprice,g=5))]]
# A low, E high
> priceid
prod_id aveprice prodid_pricerank
1: 4710085120468

21

A

2: 4714981010038

26

A

3: 4710265847666

185

D
Feature Creation

Product Preferences (2 of 2) (dcast)
Now:
1. Merge the product price class back onto each transaction row
2. Reformat and sum transaction count in each class per customer id, e.g.
> dtpop_prodprice <- data.table(dcast(dtpop
,cust_id~prodid_pricerank
,value.var="trans_id"))
> dtpop_prodprice
cust_id

A

B

C

D

E

1: 00001069

1

3

3

2

2

2: 00001113

7

1

4

5

1

3: 00001250

6

4

0

2

2

3. And further process to make proportional per row
Feature Selection

Too many features?
We have a data.table 20,000 customers x 40 synthetic features
… which we hope represent their behaviour sufficiently to distinguish them
More features == heavier processing for clustering
Can we lighten the load?
Feature Selection

Principal Components Analysis (PCA)
Standard method for reducing dimensionality, maps each datapoint to a new
coordinate system created from principal components (PCs).
PCs are ordered:
- 1st PC is aligned to maximum
variance in all features
- 2nd PC aligned to the remaining
max variance in orthogonal plane
...etc.
Where each datapoint had N features,
it now has N PC values which are a
composite of the original features.
We can now feed the first few PCs to
the clustering and keep the majority of
the variance.
Feature Selection

PCA (scale & prcomp) (1 of 2)
Scale first, so we can remove extreme outliers in original features
> cstZi_sc <- scale(cst[,which(!colnames(cst) %in% c("cust_id")),with=F])
> cstZall <- data.table(cst[,list(cust_id)],cstZi_sc)

Now all features are in units of 1 s.d.
For each row, if any one feature has value > 6 s.d., record in a filter vector
> sel <- apply(cstZall[,colnames(cstZi),with=F]
,1
,function(x){if(max(abs(x))>6){T}else{F}})
> cstZoutliers <- cstZall[sel]
> nrow(cstZoutliers)

# 830 (/20381 == 4% loss)

Health warning: we’ve moved the centre, but prcomp will re-center for us
Feature Selection

PCA (scale & prcomp) (2 of 2)
And now run prcomp to generate PCs
> cstPCAi <- prcomp(cstZ[,which(!colnames(cstZ) %in% c("cust_id")),
with=F])
> cstPCAi$sdev

# sqrt of eigenvalues

> cstPCAi$rotation

# loadings

> cstPCAi$x

# PCs (aka scores)

> summary(cstPCAi)
Importance of components:
PC1
Standard deviation

PC2

PC3

PC4 … etc

2.1916 1.8488 1.7923 1.37567

Proportion of Variance 0.1746 0.1243 0.1168 0.06882
Cumulative Proportion

0.1746 0.2989 0.4158 0.48457

Wait, prcomp Vs princomp?
Apparently princomp is faster but potentially less accurate. Performance of
prcomp is very acceptable on this small dataset (20,000 x 40)
Feature Selection

PCA quick visualisation (3 of 2)
Clustering

Finally! Lets do some clustering
Clustering

Finite Mixture Modelling
● Assume each
datapoint has a
mixture of classes,
each explained by a
different model.
● Pick a number of
models and fit to the
data, best fit wins
Clustering

Gaussian Mixture Modelling (GMM)
● Models have Gaussian dist., we can vary params
● Place N models at a random point, move and fit to data using
Expectation Maximisation (EM) algorithm
● EM is iterative method
of finding the local max
likelihood estimate.
● Slow but effective
● GMM advantage over
e.g. k-means is ability
to vary model params
to fit better
http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
Clustering

Of course there’s an R package (mclust)
Mclust v4, provides:
● Clustering,
classification, density
estimation
● Auto parameter
estimation
● Excellent default
plotting to aid live
investigation

In CRAN and detail at http://www.stat.washington.
edu/mclust/
Clustering

Finding the optimal # models (mclust)
Will automatically iterate over a number of models
(components) and covariance params

Will use the
combination with
best fit (highest
BIC)
C5, VVV
Clustering

Interpreting the model fit (1 of 3) (mclust)

PC1

The classification pairs-plot lets us view the clustering by
principal component

PC2

PC3

PC4

PC5
Clustering

Interpretation (2 of 3) (mclust)
‘Read’ the distributions w.r.
t PC components

PC1: “Variety axis”
Distinct products per basket and raw
count of distinct products overall
prodctpb_max

0.85

prodctpb_med

0.81

ipb_med

0.77

ipb_max

0.77

nprodcat

0.75

PC1

PC2: “Spendy axis”
Prop. baskets containing expensive
items, and simply raw count of items
and visits
popcat_nbaskE
popid_nbaskE

-0.71
-0.69

popcat_nbaskD

0.60

nbask
nitem
PC2

-0.51
-0.51
Clustering

Interpretation (3 of 3) (mclust)

Bob
Spendier,
higher variety,
family oriented?

PC1: Greater Variety

‘Read’ the distributions w.r.t PC components

Charles
Thriftier,
reduced selection,
shopping to a
budget?

PC2: Reduced selection of
expensive items, fewer items
We covered...
Intro What’s the problem we’re trying to solve?
Sourcing, Cleaning & Exploration
What does the data let us do?

read.table {utils}
ggplot {ggplot2}
lubridate {lubridate}

Extract additional information to enrich the set

data.table {data.table}
cut2 {Hmisc}
dcast {reshape2}

Feature Selection

scale {base}
prcomp {stats}

Feature Creation

Reduce to a smaller dataset to speed up computation

Mixture Modelling
Finding similar customers without prior information
… and interpreting the results

mclust {mclust}

More Related Content

What's hot

DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
EMRE AKCAOGLU
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
Data processing
Data processingData processing
Data processing
Sania Shoaib
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
krishna singh
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATIONDATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
csandit
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
DatapreprocessingpptShree Hari
 
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
tdc-globalcode
 
From DB-nets to Coloured Petri Nets with Priorities
From DB-nets to Coloured Petri Nets with PrioritiesFrom DB-nets to Coloured Petri Nets with Priorities
From DB-nets to Coloured Petri Nets with Priorities
Andrey Rivkin
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Abhishek Thakur
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
Data pre processing
Data pre processingData pre processing
Data pre processing
junnubabu
 
5. working on data using R -Cleaning, filtering ,transformation, Sampling
5. working on data using R -Cleaning, filtering ,transformation, Sampling5. working on data using R -Cleaning, filtering ,transformation, Sampling
5. working on data using R -Cleaning, filtering ,transformation, Sampling
krishna singh
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 

What's hot (17)

DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)DataCamp Cheat Sheets 4 Python Users (2020)
DataCamp Cheat Sheets 4 Python Users (2020)
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Data processing
Data processingData processing
Data processing
 
3. R- list and data frame
3. R- list and data frame3. R- list and data frame
3. R- list and data frame
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATIONDATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
DATA SHARING TAXONOMY RECORDS FOR SECURITY CONSERVATION
 
Datapreprocessing
DatapreprocessingDatapreprocessing
Datapreprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Datapreprocessingppt
DatapreprocessingpptDatapreprocessingppt
Datapreprocessingppt
 
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha Java EE How we figured out we had a SRE team at ...
 
From DB-nets to Coloured Petri Nets with Priorities
From DB-nets to Coloured Petri Nets with PrioritiesFrom DB-nets to Coloured Petri Nets with Priorities
From DB-nets to Coloured Petri Nets with Priorities
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
5. working on data using R -Cleaning, filtering ,transformation, Sampling
5. working on data using R -Cleaning, filtering ,transformation, Sampling5. working on data using R -Cleaning, filtering ,transformation, Sampling
5. working on data using R -Cleaning, filtering ,transformation, Sampling
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
Data1
Data1Data1
Data1
 

Viewers also liked

Segmentation on Shopping Behavior
Segmentation on Shopping BehaviorSegmentation on Shopping Behavior
Segmentation on Shopping Behavior
rajeev227
 
segmentation variables for retail stores
segmentation variables for retail storessegmentation variables for retail stores
segmentation variables for retail stores
nizmon
 
Shopping Occasion Segmentation
Shopping Occasion SegmentationShopping Occasion Segmentation
Shopping Occasion Segmentation
Copernicus Marketing Consulting & Research
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer SegmentationCarlos Soares
 
Consumer Behavior and Segmentation
Consumer Behavior and SegmentationConsumer Behavior and Segmentation
Consumer Behavior and Segmentation
Syed Islam
 
RFM: A Cool Tool for Simple Analytics
RFM: A Cool Tool for Simple AnalyticsRFM: A Cool Tool for Simple Analytics
RFM: A Cool Tool for Simple Analytics
C.TRAC Inc.
 
Effective customer segmentation
Effective customer segmentationEffective customer segmentation
Effective customer segmentation
Sherpas
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
Edureka!
 
Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentationweave Belgium
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
shanelynn
 
CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...
CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...
CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...
Data Science Thailand
 
Self-organizing maps - Tutorial
Self-organizing maps -  TutorialSelf-organizing maps -  Tutorial
Self-organizing maps - Tutorial
askroll
 
RFM Segmentation
RFM SegmentationRFM Segmentation
RFM Segmentation
Kamil Bartocha
 
Customer Segmentation Principles
Customer Segmentation PrinciplesCustomer Segmentation Principles
Customer Segmentation Principles
Vladimir Dimitroff
 
Segmentation Best Practices
Segmentation Best PracticesSegmentation Best Practices
Segmentation Best Practices
Chadwick Martin Bailey
 

Viewers also liked (15)

Segmentation on Shopping Behavior
Segmentation on Shopping BehaviorSegmentation on Shopping Behavior
Segmentation on Shopping Behavior
 
segmentation variables for retail stores
segmentation variables for retail storessegmentation variables for retail stores
segmentation variables for retail stores
 
Shopping Occasion Segmentation
Shopping Occasion SegmentationShopping Occasion Segmentation
Shopping Occasion Segmentation
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 
Consumer Behavior and Segmentation
Consumer Behavior and SegmentationConsumer Behavior and Segmentation
Consumer Behavior and Segmentation
 
RFM: A Cool Tool for Simple Analytics
RFM: A Cool Tool for Simple AnalyticsRFM: A Cool Tool for Simple Analytics
RFM: A Cool Tool for Simple Analytics
 
Effective customer segmentation
Effective customer segmentationEffective customer segmentation
Effective customer segmentation
 
Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples Application of Clustering in Data Science using Real-life Examples
Application of Clustering in Data Science using Real-life Examples
 
Customer segmentation
Customer segmentationCustomer segmentation
Customer segmentation
 
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin RSelf-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R
 
CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...
CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...
CUSTOMER ANALYTICS & SEGMENTATION FOR CUSTOMER CENTRIC ORGANIZATION & MARKETI...
 
Self-organizing maps - Tutorial
Self-organizing maps -  TutorialSelf-organizing maps -  Tutorial
Self-organizing maps - Tutorial
 
RFM Segmentation
RFM SegmentationRFM Segmentation
RFM Segmentation
 
Customer Segmentation Principles
Customer Segmentation PrinciplesCustomer Segmentation Principles
Customer Segmentation Principles
 
Segmentation Best Practices
Segmentation Best PracticesSegmentation Best Practices
Segmentation Best Practices
 

Similar to Customer Clustering for Retailer Marketing

R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
Shivammittal880395
 
KPMG - TASK 1.pdf
KPMG - TASK 1.pdfKPMG - TASK 1.pdf
KPMG - TASK 1.pdf
Darshana6228
 
Superstore Data Analysis using R
Superstore Data Analysis using RSuperstore Data Analysis using R
Superstore Data Analysis using R
Monika Mishra
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
Amazon Web Services
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
YashikaSengar2
 
Database Management Systems Lab manual (KR20) CSE.pdf
Database Management Systems Lab manual (KR20) CSE.pdfDatabase Management Systems Lab manual (KR20) CSE.pdf
Database Management Systems Lab manual (KR20) CSE.pdf
Anvesh71
 
Histograms: Pre-12c and now
Histograms: Pre-12c and nowHistograms: Pre-12c and now
Histograms: Pre-12c and now
Anju Garg
 
Xgboost
XgboostXgboost
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
Julian Hyde
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang
 
E_Commerce
E_CommerceE_Commerce
E_Commerce
SilpiNandi1
 
E_Commerce Data model
E_Commerce Data modelE_Commerce Data model
E_Commerce Data model
SilpiNandi1
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
Maxime Beugnet
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
MongoDB
 
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
Universidad de los Andes
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Gabriel Moreira
 
SQL
SQLSQL
Informix physical database design for data warehousing
Informix physical database design for data warehousingInformix physical database design for data warehousing
Informix physical database design for data warehousingKeshav Murthy
 
R programming
R programmingR programming
R programming
Pramodkumar Jha
 

Similar to Customer Clustering for Retailer Marketing (20)

R_Proficiency.pptx
R_Proficiency.pptxR_Proficiency.pptx
R_Proficiency.pptx
 
KPMG - TASK 1.pdf
KPMG - TASK 1.pdfKPMG - TASK 1.pdf
KPMG - TASK 1.pdf
 
Superstore Data Analysis using R
Superstore Data Analysis using RSuperstore Data Analysis using R
Superstore Data Analysis using R
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
 
Database Management Systems Lab manual (KR20) CSE.pdf
Database Management Systems Lab manual (KR20) CSE.pdfDatabase Management Systems Lab manual (KR20) CSE.pdf
Database Management Systems Lab manual (KR20) CSE.pdf
 
Histograms: Pre-12c and now
Histograms: Pre-12c and nowHistograms: Pre-12c and now
Histograms: Pre-12c and now
 
Xgboost
XgboostXgboost
Xgboost
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
E_Commerce
E_CommerceE_Commerce
E_Commerce
 
E_Commerce Data model
E_Commerce Data modelE_Commerce Data model
E_Commerce Data model
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
MongoDB World 2018: Keynote
MongoDB World 2018: KeynoteMongoDB World 2018: Keynote
MongoDB World 2018: Keynote
 
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
SQL
SQL SQL
SQL
 
SQL
SQLSQL
SQL
 
Informix physical database design for data warehousing
Informix physical database design for data warehousingInformix physical database design for data warehousing
Informix physical database design for data warehousing
 
R programming
R programmingR programming
R programming
 

More from Jonathan Sedar

Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
Jonathan Sedar
 
How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?
Jonathan Sedar
 
Visualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNEVisualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNE
Jonathan Sedar
 
Bayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionBayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier Detection
Jonathan Sedar
 
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Jonathan Sedar
 
Applied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science DeptApplied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science DeptJonathan Sedar
 
Text mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science projectText mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science project
Jonathan Sedar
 

More from Jonathan Sedar (7)

Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?How is Data Science going to Improve Insurance?
How is Data Science going to Improve Insurance?
 
Visualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNEVisualising High Dimensional Data with TSNE
Visualising High Dimensional Data with TSNE
 
Bayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier DetectionBayesian Robust Linear Regression with Outlier Detection
Bayesian Robust Linear Regression with Outlier Detection
 
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
Topic Modelling on the Enron Email Corpus @ ODSC 13 Apr 2016
 
Applied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science DeptApplied AI Tech Talk: How to Setup a Data Science Dept
Applied AI Tech Talk: How to Setup a Data Science Dept
 
Text mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science projectText mining to correct missing CRM information: a practical data science project
Text mining to correct missing CRM information: a practical data science project
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 

Customer Clustering for Retailer Marketing

  • 1. Customer Clustering for Retailer Marketing A worked example of machine learning in industry, with reference to useful R packages Talk for Dublin R User Group - 06 Nov 2013 Jonathan Sedar - Consulting Data Scientist @jonsedar
  • 2. Who’s shopping at my stores and how can I market to them?
  • 3. Lets try to group them by similar shopping behaviours
  • 4. Overview Intro What’s the problem we’re trying to solve? Sourcing, Cleaning & Exploration What does the data let us do? read.table {utils} ggplot {ggplot2} lubridate {lubridate} Extract additional information to enrich the set data.table {data.table} cut2 {Hmisc} dcast {reshape2} Feature Selection scale {base} prcomp {stats} Feature Creation Reduce to a smaller dataset to speed up computation Mixture Modelling Finding similar customers without prior information … and interpreting the results mclust {mclust}
  • 5. Intro Customer profiling enables targeted marketing and can improve operations Retention offers Product promotions Loyalty rewards Optimise stock levels & store layout Bob Dublin 6 Age 42 Married? Bob Dublin 6 Age 42 Married? Type: “Family First”
  • 6. Intro We want to turn transactional data into customer classifications 32,000 customers 24,000 items 800,000 transactions … over a 4 month pd Magic A real dataset: Many, many ways to approach the problem!
  • 7. Intro Practical machine learning projects tend to have a similar structure Source your raw data Exploration & visualisation Feature selection Model creation Cleaning & importing Feature creation Model optimisation & interpretation
  • 8. Intro Practical machine learning projects tend to have a similar structure Source your raw data Exploration & visualisation Feature selection Model creation Cleaning & importing Feature creation Model optimisation & interpretation
  • 9. Sourcing, Cleansing, Exploration So what information do we have?
  • 10. Sourcing, Cleansing, Exploration What information do we have? “Ta-Feng” grocery shopping dataset 800,000 transactions 32,000 customer ids 24,000 product ids 4-month period over Winter 2000-2001 http://recsyswiki.com/wiki/Grocery_shopping_datasets trans_date cust_id age res_area product_id quantity price 1: 2000-11-01 00046855 D E 4710085120468 3 57 2: 2000-11-01 00539166 E E 4714981010038 2 48 3: 2000-11-01 00663373 F E 4710265847666 1 135 ...
  • 11. Sourcing, Cleansing, Exploration Data definition and audit (1 of 2) A README file, excellent... 4 ASCII text files: # D11: Transaction data collected in November, 2000 # D12: Transaction data collected in December, 2000 # D01: Transaction data collected in January, 2001 # D02: Transaction data collected in February, 2001 Curious choice of delimiter and an extended charset # First line: Column definition in Traditional Chinese # §È¥¡;∑|≠˚•d∏π;¶~ƒ÷;∞œ∞Ï;∞”´~§¿√˛;∞”´~ΩsΩX;º∆∂q;¶®•ª;æP∞‚ # Second line and the rest: data columns separated by ";" Pre-clean in shell awk -F";" 'gsub(":","",$1)' D02
  • 12. Sourcing, Cleansing, Exploration Data definition and audit (2 of 2) Although prepared by another researcher, can still find undocumented gotchas: # 1: Transaction date and time (time invalid and useless) # 2: Customer ID # 3: Age: 10 possible values, # A <25,B 25-29,C 30-34,D 35-39,E 40-44,F 45-49,G 50-54,H 55-59,I 60-64,J >65 # actually there's 22362 rows with value K, will assume it's Unknown # 4: Residence Area: 8 possible values, # A-F: zipcode area: 105,106,110,114,115,221,G: others, H: Unknown # Distance to store, from the closest: 115,221,114,105,106,110 # so we’ll factor this with levels "E","F","D","A","B","C","G","H" # 5: Product subclass # 6: Product ID # 7: Amount # 8: Asset ignore # not explained, low values, not an id, will
  • 13. Sourcing, Cleansing, Exploration Import & preprocess (read.table) Read each file into a data.table, whilst applying basic data types > dtnov <- data.table(read.table(fqn,col.names=cl$names, colClasses=cl$types ,encoding="UTF-8",stringsAsFactors=F)); Alternatives inc RODBC / RPostgreSQL > con <- dbConnect(dbDriver("PostgreSQL"), host="localhost", port=5432 ,dbname=”tafeng”, user=”jon”, password=””) > dtnov <- dbGetQuery(con,"select * from NovTransactions")
  • 14. Sourcing, Cleansing, Exploration Import & preprocess (lubridate) Convert some datatypes to be more useful: String -> POSIXct datetime (UNIX time UTC) using lubridate > dtraw[,trans_date:= ymd(trans_date )] > cat(ymd("2013-11-05")) 1383609600 … also, applying factor levels to the residential area > dtraw[,res_area:= factor(res_area ,levels=c("E","F","D","A","B","C","G","H") )]
  • 15. Sourcing, Cleansing, Exploration Explore: Group By (data.table) How many transactions, dates, customers, products and product subclasses? > nrow(dt[,1,by=cust_id]) # 32,266 Using data.table’s dt[i,j,k] structure where: i subselects rows SQL WHERE j selects / creates columns SQL SELECT k groups by columns SQL GROUP BY e.g. the above is: select count (*) from dt group by cust_id
  • 16. Sourcing, Cleansing, Exploration Example of data logic-level cleaning Product hierarchies: we assumed many product_ids to one product_category, but actually … a handful of product_ids belong to 2 or 3 product_cats: > transcatid <- dt[,list(nbask=length(trans_id)),by=list(prod_cat, prod_id)] > transid <- transcatid[,list(ncat=length(prod_cat),nbask=sum(nbask)) ,by=prod_id] > transid[,length(prod_id),by=ncat] ncat V1 1: 1 23557 2: 2 253 3: 3 2 Solution: dedupe. keep prod_id-prod_cat combos with the largest nbask > ids <- transid[ncat>1,prod_id] > transcatid[prod_id %in% ids,rank :=rank(-nbask),by=prod_id] > goodprodcat <- transcatid[is.na(rank) | rank ==1,list(prod_cat, prod_id)]
  • 17. Sourcing, Cleansing, Exploration Explore: Visualise (ggplot) (1 of 4) e.g. transactions by date p1 <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) + geom_bar(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8) plot(p1)
  • 18. Sourcing, Cleansing, Exploration Explore: Visualise (ggplot) (1.5 of 4) e.g. transactions by date (alternate plotting) p1b <- ggplot(dt[,list(num_trans=length(trans_id)),by=trans_date]) + geom_point(aes(x=trans_date,y=num_trans),stat='identity',alpha=0.8) + geom_smooth(aes(x=trans_date,y=num_trans), method=’loess’,alpha=0.8) plot(p1b)
  • 19. Sourcing, Cleansing, Exploration Explore: Visualise (ggplot) (2 of 4) e.g. histogram count of customers with N items bought p2 <- ggplot(dt[,list(numitem=length(trans_id)),by=cust_id]) + geom_bar(aes(x=numitem),stat='bin',binwidth=10,alpha=0.8,fill=orange) + coord_cartesian(xlim=c(0,200)) plot(p2)
  • 20. Sourcing, Cleansing, Exploration Explore: Visualise (ggplot) (3 of 4) e.g. scatterplot of total items vs total baskets per customer p4a <- ggplot(dttt) + geom_point(aes(x=numbask,y=numitem),size=1,alpha=0.8) + geom_smooth(aes(x=numbask,y=numitem),method="lm") plot(p4a)
  • 21. Sourcing, Cleansing, Exploration Explore: Visualise (ggplot) (4 of 4) e.g. scatterplot of total items vs total baskets per customer per res_area p5 <- ggplot(dttt) + geom_point(aes(x=numbask,y=numitem,color=res_area),size=1,alpha=0.8) + geom_smooth(aes(x=numbask,y=numitem),method="lm",color=colorBlind[1]) + facet_wrap(~res_area) plot(p5) A-F: zipcode area: 105,106,110,114,115,221 G: others H: Unknown Dist to store, from closest: E < F < D < A <B < C
  • 22. Feature Creation Can we find or extract more info?
  • 23. Feature Creation Create New Features Per customer (32,000 of them): Counts: # total baskets (==unique days) # total items # total spend # unique prod_subclass, unique prod_id Distributions (min - med - max will do): # items per basket # spend per basket # product_ids , prod_cats per basket # duration between visits Product preferences # prop. of baskets in the N bands of product cats & ids by item pop. # prop. of baskets in the N bands of product ids by item price
  • 24. Feature Creation Counts Pretty straightforward use of group by with data.table > counts <- dt[,list(nbask=length(trans_id) ,nitem=sum(quantity) ,spend=sum(quantity*price)) ,by=list(cust_id)] > setkey(counts,cust_id) > counts cust_id nbask nitem spend 1: 00046855 1 3 171 2: 00539166 4 8 300 3: 00663373 1 1 135
  • 25. Feature Creation Distributions Again, making use of group by with data.table and list to form new data table > dists_ispb <- dt[,list(nitem=sum(quantity) ,spend=sum(quantity*price)) ,by=list(cust_id,trans_date)] > dists_ispb <- dists_ispb[,list(ipb_max=max(nitem) ,ipb_med=median(nitem) ,ipb_min=min(nitem) ,spb_max=max(spend) ,spb_med=median(spend) ,spb_min=min(spend)) ,by=cust_id] > setkey(dists_ispb,cust_id)
  • 26. Feature Creation Example considerations: it is acceptable to lose datapoints? Feature: duration between visits If customers visited once only, they have value NA - issue for MClust Solutions: A: remove them from modelling? wasteful in this case (lose 30%!) But maybe we don’t care about classifying one-time shoppers B: or give them all the same value But which value? all == 0 isn't quite true, and many will skew clustering C: impute values based on the global mean and SD of each col Usually a reasonable fix, except for ratio columns, where very clumsy and likely misleading, requiring mirroring to get +ve axes
  • 27. Feature Creation Product Preferences (1 of 2) (Hmisc) Trickier, since we don’t have a product hierarchy e.g Food > Bakery > Bread > Sliced White > Brennans But we do price per unit and inherent measure of popularity in the transaction log, e.g. > priceid <- dt[,list(aveprice=median(price)),by=prod_id] > priceid[,prodid_pricerank:=LETTERS[as.numeric(cut2(aveprice,g=5))]] # A low, E high > priceid prod_id aveprice prodid_pricerank 1: 4710085120468 21 A 2: 4714981010038 26 A 3: 4710265847666 185 D
  • 28. Feature Creation Product Preferences (2 of 2) (dcast) Now: 1. Merge the product price class back onto each transaction row 2. Reformat and sum transaction count in each class per customer id, e.g. > dtpop_prodprice <- data.table(dcast(dtpop ,cust_id~prodid_pricerank ,value.var="trans_id")) > dtpop_prodprice cust_id A B C D E 1: 00001069 1 3 3 2 2 2: 00001113 7 1 4 5 1 3: 00001250 6 4 0 2 2 3. And further process to make proportional per row
  • 29. Feature Selection Too many features? We have a data.table 20,000 customers x 40 synthetic features … which we hope represent their behaviour sufficiently to distinguish them More features == heavier processing for clustering Can we lighten the load?
  • 30. Feature Selection Principal Components Analysis (PCA) Standard method for reducing dimensionality, maps each datapoint to a new coordinate system created from principal components (PCs). PCs are ordered: - 1st PC is aligned to maximum variance in all features - 2nd PC aligned to the remaining max variance in orthogonal plane ...etc. Where each datapoint had N features, it now has N PC values which are a composite of the original features. We can now feed the first few PCs to the clustering and keep the majority of the variance.
  • 31. Feature Selection PCA (scale & prcomp) (1 of 2) Scale first, so we can remove extreme outliers in original features > cstZi_sc <- scale(cst[,which(!colnames(cst) %in% c("cust_id")),with=F]) > cstZall <- data.table(cst[,list(cust_id)],cstZi_sc) Now all features are in units of 1 s.d. For each row, if any one feature has value > 6 s.d., record in a filter vector > sel <- apply(cstZall[,colnames(cstZi),with=F] ,1 ,function(x){if(max(abs(x))>6){T}else{F}}) > cstZoutliers <- cstZall[sel] > nrow(cstZoutliers) # 830 (/20381 == 4% loss) Health warning: we’ve moved the centre, but prcomp will re-center for us
  • 32. Feature Selection PCA (scale & prcomp) (2 of 2) And now run prcomp to generate PCs > cstPCAi <- prcomp(cstZ[,which(!colnames(cstZ) %in% c("cust_id")), with=F]) > cstPCAi$sdev # sqrt of eigenvalues > cstPCAi$rotation # loadings > cstPCAi$x # PCs (aka scores) > summary(cstPCAi) Importance of components: PC1 Standard deviation PC2 PC3 PC4 … etc 2.1916 1.8488 1.7923 1.37567 Proportion of Variance 0.1746 0.1243 0.1168 0.06882 Cumulative Proportion 0.1746 0.2989 0.4158 0.48457 Wait, prcomp Vs princomp? Apparently princomp is faster but potentially less accurate. Performance of prcomp is very acceptable on this small dataset (20,000 x 40)
  • 33. Feature Selection PCA quick visualisation (3 of 2)
  • 34. Clustering Finally! Lets do some clustering
  • 35. Clustering Finite Mixture Modelling ● Assume each datapoint has a mixture of classes, each explained by a different model. ● Pick a number of models and fit to the data, best fit wins
  • 36. Clustering Gaussian Mixture Modelling (GMM) ● Models have Gaussian dist., we can vary params ● Place N models at a random point, move and fit to data using Expectation Maximisation (EM) algorithm ● EM is iterative method of finding the local max likelihood estimate. ● Slow but effective ● GMM advantage over e.g. k-means is ability to vary model params to fit better http://ai.stanford.edu/~chuongdo/papers/em_tutorial.pdf
  • 37. Clustering Of course there’s an R package (mclust) Mclust v4, provides: ● Clustering, classification, density estimation ● Auto parameter estimation ● Excellent default plotting to aid live investigation In CRAN and detail at http://www.stat.washington. edu/mclust/
  • 38. Clustering Finding the optimal # models (mclust) Will automatically iterate over a number of models (components) and covariance params Will use the combination with best fit (highest BIC) C5, VVV
  • 39. Clustering Interpreting the model fit (1 of 3) (mclust) PC1 The classification pairs-plot lets us view the clustering by principal component PC2 PC3 PC4 PC5
  • 40. Clustering Interpretation (2 of 3) (mclust) ‘Read’ the distributions w.r. t PC components PC1: “Variety axis” Distinct products per basket and raw count of distinct products overall prodctpb_max 0.85 prodctpb_med 0.81 ipb_med 0.77 ipb_max 0.77 nprodcat 0.75 PC1 PC2: “Spendy axis” Prop. baskets containing expensive items, and simply raw count of items and visits popcat_nbaskE popid_nbaskE -0.71 -0.69 popcat_nbaskD 0.60 nbask nitem PC2 -0.51 -0.51
  • 41. Clustering Interpretation (3 of 3) (mclust) Bob Spendier, higher variety, family oriented? PC1: Greater Variety ‘Read’ the distributions w.r.t PC components Charles Thriftier, reduced selection, shopping to a budget? PC2: Reduced selection of expensive items, fewer items
  • 42. We covered... Intro What’s the problem we’re trying to solve? Sourcing, Cleaning & Exploration What does the data let us do? read.table {utils} ggplot {ggplot2} lubridate {lubridate} Extract additional information to enrich the set data.table {data.table} cut2 {Hmisc} dcast {reshape2} Feature Selection scale {base} prcomp {stats} Feature Creation Reduce to a smaller dataset to speed up computation Mixture Modelling Finding similar customers without prior information … and interpreting the results mclust {mclust}