The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli

The Barclays Data
Science Hackathon:
Building Retail Recommender
Systems based on Customer
Shopping Behavior
Gianmario Spacagna
@gm_spacagna
Data Science Milan meetup, 13 July 2016

The Barclays Data Science Team
•  Retail Business Banking division based in the HQ
(Canary Wharf, London)
•  Back in time (Dec 2015) was 6 members:
Head + mix of (engineering and machine learning) specialists
•  Goal: building data-driven applications such as:
–  Insights Engine for small businesses
–  Complaints NLP analytics
–  Mortgage predictive models
–  Pricing optimisation
–  Graph fraud detection
–  and so on...

Lanzarote off-site
•  1 week (5 days contest
Monday - Friday)
•  Building a recommender
system of retail merchants for
people living in Bristol, UK
•  Forget about 9-5 working
hours
•  Stimulate creativity and team-
working
•  Brainstorm new ideas and
make them happen
•  Have fun!

The technical challenges
•  No infrastructure available, only laptops and a
1G WiFi shared Internet connection.
•  Build, test, and refactor quickly,
no time for long end-to-end evaluations.
•  Work with common structures without
constraining individual initiative and innovation.
•  Design for deployment to production on a multi-
tenant cluster.

Code @ll 3am, wake up early in the morning and go surﬁng!
Enjoy canarian cuisine…
…and local wine

The Professional Data Science Manifesto
work in progress…

Why Spark? (just to name a few…)
•  Speed / performance, in-memory solution
•  Elastic jobs, you can start small and scale up
•  What works locally works distributed, almost!
•  Single place for doing everything from source to the
endpoint
•  It cuts development time being designed according to
functional programming principles
•  Reproducibility via a DAG of declarative transformations
rather than procedural side-effect actions

Preparation work (ETL)
•  Extract, transform and load data into representations
matching the business domain rather than the raw
database representation
•  Aggregate in order to increase generality but
preserving anonymised information for training the
models
•  Every business is uniquely represented by the
combo (MerchantName, MerchantTown) + optionally
a postcode when available
•  Join each transaction happened in Bristol with the
business and customer details

Anonymised Generalised Data
•  Bottom-up k-anonymity:
–  Map all of the categorical attributes of each customer
(online active flag, residential area type, gender,
marital status, occupation) into a bucket
–  Group similar customers and replace the single
bucket with a group of buckets and count the number
of group members
–  Recursively continue until each user is mapped into a
bucket group with at least k members
•  Masking:
–  Replace user identifiers with uniquely generated IDs

K-anonymity example
!mestamp customerId occupa!
on
gender amount business
2015-03-05 9218324 Engineer male 58.42 Waitrose
2015-03-06 324624 Cook female 118.90 Waitrose
2015-03-06

324624 Cook female 5.99 Abokado
Categorical bucket Day of
week
custome
rId
amount business
engineer-male,
student-male,
cook-female
Thursday 00003 [50-60] Waitrose
Friday 00012 [100--1
20]
Waitrose
Friday 00012 [0-10] Abokado

Data Types
AnonymizedRecord corresponds to a single transac@on where:
•  Customer conﬁden@al informa@on have been masked and
a[ributes generalised into a set of possible buckets
•  Business informa@on are clear (name, town and op@onal
postcode)
•  Time is only represented as day of week
•  Amount was binned to reduce resolu@on

Some numbers (Bristol only)
•  ~ 70 GB of data
(Kryo serialized format)
•  A few millions
transactions from 2015
(1 year worth of data)
•  ~ 100k Barclays retail
customers
•  ~ 50K Businesses

Recommender APIs
•  RecommenderTrainer receives the raw data and has to
perform the feature engineering tailored for the specific
implementation and return a Recommender model instance.
•  The Recommender instance takes an RDD of customer ids
and a positive number N and returns at top N
recommendations for each customer.
•  We used the pair (MerchantName, MerchantTown) to
represent the unique business we want to recommend.

Thoughts on Efficient Spark Programming
(Vancouver Spark Meetup 03-09-2015)
http://www.slideshare.net/nielsh1/thoughts-on-efficient-spark-
programming-vancouver-spark-meetup-03092015

Split data by
customer id
NOT by
transac@on
Down-sample
test customers
for quick
evalua@ons
Train and get recommenda@ons
Check the model is not chea@ng
Ground truth for evalua@on
Compute MAP

Mean Average Precision (MAP)
•  Each customer has visited m relevant businesses
•  Recommendations predict n ranked businesses
•  For a given customer we compute the average precision as:
•  P(k) = precision at cut-off k in the recommendation list, i.e.
the ratio of number of relevant businesses, up to the
position k.
P(k) = 0 when the k-th business is not relevant.
•  MAP for N customers at n is the average of the average
precision of each customer:
ap@n = P(k) / min(m,n)
k=1
n
∑
MAP@n = ap @ ni
/ N
i=1
N
∑

MAP example
= Businesses visited by test user Bob
? ? ?
Recommenda@ons
#Bob, N = 6
Precision(k): 1/1 0 2/3 0 0 3/6
Average Precision #Bob = (1 + 2/3 + 3/6) / 3 = 0.722
Average Precision #Alice = (1/2 + 2/5) / 2 = 0.45
MAP@6 = (0.722 + 0.45) / 2 = 0.586
= Businesses visited by test user Alice
? ?
Recommenda@ons
#Alice, N = 6
Precision(k): 0 1/2 0 0 2/5 0
? ?

Most Popular Businesses
Learn most
popular
businesses
during training
and broadcast
them into a list
Create a recommender that maps
every customer id to the same top n
businesses
Most popular businesses recommender could be used as baseline and also
as “padder” for ﬁlling missing recommenda@ons of more advanced
recommenders.

CUSTOMER-TO-CUSTOMER
SIMILARITY MODELS
Each customer is represented in a sparse feature space
Must define a metric space that satisfies the triangle inequality
Similarity (or distance) based on:
Common behaviour (geographical and temporal shopping journeys)
Common demographic attributes (age, residential area, gender, job
position…)

Customer Features
•  Represent each customer in terms of histograms:
–  Distribution of spending across different dimensions:
•  week days, postcode sectors, merchant categories, businesses
–  Probability distributions of its generalised attributes:
•  Online activity, gender, marital status, occupation
•  If we flatten each map and fill with 0s all of the
missing keys, we can then compute the cosine
distance between two customers

Extracting Customer Features 1/2
Businesses are
too many to ﬁt
into a Map, we
only take the
top ones and
assume the tail
to be negligible
Wallet histogram:
Count of each (customer, bin)
using reduceByKey followed
by groupBy on customer to
merge all of the bins count
into a map

Extracting Customer Features 2/2
Broadcast
variables
should be
destroyed at
the end of
their scope
1. select the
dis@nct
customer Id
with the
associated
categorical
group

2. perform a
map-side mul@-
join:
One map over
the whole RDD
with mul@ple
look-ups into
broadcast maps

K-Neighbours Recommender Take the
previously
computed
customer
features and
build a VPTree
For each
customer ﬁnd the
approximated
nearest K similar
(1 – distance)
neighbours and
assign a score to
each business in
the neighbour
wallet
propor@oned to
the rela@ve
similarity score
Since same business may appear
mul@ple @mes, sum all the scores
and take top-ranked N

Vantage-point (VP) Tree
•  It’s an heuristic data structure
for fast spatial search
•  Each node of the tree contains
one data point + a radius
–  Left child branch contains points
that are closer than the radius,
right the farther away
•  Construction time: O(n log(n))
•  Search time*: O(log(n))
*Under certain circumstances

BUSINESS-TO-BUSINESS
SIMILARITY MODELS
Similarity metric based on the portion of
common customers
Conditional probability
Tanimoto Coefficient

Common customers matrix
Sum
- 3 10 12 25
3 - 8 0 11
10 8 - 1 19
12 0 1 - 13
Sum
25 11 19 13 -
Each cell
represent the
dis@nct number
of common
customers

Business
similari@es:
•  Condi@onal
probability
•  Tanimoto
coeﬃcient

0.7
0.3
0.1
0.5
0.2
0
0.2 -> 0
0.4 -> 0
0.3
0.1
0.2
Visited
businesses
B1
Visited businesses’
neighbours
B2
Weights sum excluding visited:
0.8
0.6
“Probability” score
P(c) = P(B2c / B1a) * P(B1a) +
P(B2c / B1b) * P(B1b)
(0.1/0.8)*0.7 + (0.3/0.6)*0.3 =
0.2375
(0.5/0.8)*0.7 + (0.1/0.6) * 0.3 =
0.4875
(0.2/0.8)*0.7 + (0.2/0.6)*0.3 =
0.275
0
a
a
b
c
d
e
e

NEIGHBOUR-TO-BUSINESS
Hybrid approach of K-Neighbours combined with
Business-to-Business
3 levels: customer neighbours -> neighbour’s
businesses -> businesses’ neighbours
We named this model: Botticelli model

Customer’s
neighbours
Direct businesses +
neighbours’s businesses
Businesses’s neighbours

We know visited
business frequency
from our own wallet
and we ﬁll the others
with our neighbour’s
normalized frequency

MATRIX FACTORIZATION
MODELS
Factorize the transaction matrix of Customer-to-
Business into 2 matrices of Customer-to-Topic
and Topic-to-Business (e.g. LSA, SVD…)
Recommendations are done by applying linear
algebra

Topic Modeling for Learning Analytics
Researchers LAK15 Tutorial
http://www.slideshare.net/vitomirkovanovic/topic-modeling-for-
learning-analytics-researchers-lak15-tutorial

ALS is available in Spark MLlib
Ra@ngs as
counts of
transac@ons
Model parameters are the
factorized matrices. We had to
re-implement the scoring
func@on due to scalability issues

Recommendation scores produced by
multiplying vectors

Top N without sorting
Accumulator is at most N elements

OTHER APPROACHES
Covariance Matrix:
build a covariance matrix of each pair of users and then
multiply it with the user-to-business matrix
Random Forest:
one binary classifier for each business
Ensembling models:
aggregating recommendations from different models

Models comparison
Neighbour-to-Businesses
tanimoto)
ALS
Covariance matrix
(condi@onal prob)
K-Neighbours
Most popular
16%
12%
11%
10%
9%
8%
3%
MAP@20
Remember: for every national
retail chain where you have a
lot of customers, you have a lot
of local niche businesses
where only a small portion of of
the customer base ever shop
there -> Very hard to predict
those!
Simple solutions made of
counts and divisions may out-
perform more advanced ones

Limitations
•  ML and MLlib are not flexible enough and need
some extra development (bloody private fields)
•  Linear algebra libraries in MLlib are limited, it
took as a while to learn how to optimize them
•  Scala and Spark create confusion for some
method behaviour
(e.g. fold, collect, mapValues, groupBy)
•  Many machine learning libraries are based on
vectors and don’t easily allow ad-hoc definition
of data types based on the business context

Conclusions
•  Spark and Scala were excellent tools for rapid
prototyping during the week, especially for
bespoke algorithms.
•  We used the same production stack together
with notebooks for ad-hoc explorations or quick
and dirty tests.
•  At the end of the hackathon the best model is
almost a production-ready MVP

Automated single-
bu[on execu@on
Built a real-world
recommender
Common
evalua@on APIs
Data valida@on
manually done as
prepara@on step
Only MAP
considered
Notebook analysis
immediately
followed by
knowledge
conversion into
code requirements
Our MVP was
simplis@c and not
considering a few
edge cases

Off-site
•  Success of the hackathon was not solely down
to technology.
•  Innovation requires an environment where:
–  great people can connect
–  set clear ambitious goals
–  work together free of distractions
–  pressure of delivering comes from the group
–  Fail safely, go to sleep, wake up next day (go surfing)
and try again!

https://blog.cloudera.com/blog/2016/05/the-barclays-data-science-hackathon-using-apache-spark-and-scala-for-rapid-prototyping/
Original article on Cloudera Engineering Blog
https://github.com/gm-spacagna/lanzarote-awesomeness
GitHub code
Further Reading
A lot of references regarding Agile and Spark
http://datasciencevademecum.wordpress.com
Data Science Vademecum
The Barclays Data Science team at this hackathon was:
Panos Malliakas, Victor Paraschiv, Harry Powell, Charis
Sfyrakis, Gianmario Spacagna and Raﬀael Strassnig
http://www.datasciencemanifesto.org/
The Professional Data Science Manifesto

The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli

More Related Content

What's hot

Viewers also liked

Similar to The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli

More from Data Science Milan

Recently uploaded

The Barclays Data Science Hackathon: Building Retail Recommender Systems based on Customer Shopping Behaviour - Gianmario Spacagna, Pirelli