CXENSE 2017 | Large-Scale User Similarity Modeling

CXENSE 2017 | DEXA www.cxense.com
Large-Scale User Similarity Modeling
Arne Sund, Head of Data Science at Cxense

CXENSE 2017 www.cxense.com
• Norwegian tech company with a global presence
• ~60 engineers incl. 5 data scientists
• Media and Publishing vertical
• Delivering solutions for:
• Insight into online user patterns
• Superior content recommendations
• Segmentation of users using ML / AI
• Create campaigns for targeting on own sites
A global SaaS company

• Declining ad revenue
• The Duopoly
• How to attract digital-only subscribers
• How to keep existing subscribers
How to engage users online and keep them on
our site longer?
Challenges facing publishers online

• Awareness
• Consideration
• Subscription
• Loyalty
Powering the journey from casual visitor to subscriber
Anonymous
User
Known
User
• Churn prevention
Revenue

• To tailor content recommendations and offers/ads
• Small set of known users as input
• Find similar anonymous users
User Similarity Modeling

CXENSE www.cxense.com
User Similarity Modeling
Original segment
(Truth sample)
Lookalike segment
(predicted)
All unique users

Defining user similarity
Users Are What They Read
• Represent users as vectors of consumed content
• Hit count vector of words and phrases
• High-dimensional vector space
• Computed using multiple algorithms

Pageview events
User 4sk9yk1sb7v8yxas visited adressa.no at 18:02
Augmented with additional details
• Device: Type, Brand, Browser, OS
• Location: Country, Region, City
• Referrer: URL, Type
• Engagement: Active time, scroll depth
• ...

Content profiles
NLP
Evjen solgt til AZ Alkmaar.
Vingsensasjonen Håkon Evjen
forlater Bodø/Glimt etter
sesongen og blir proff i
nederlandske AZ Alkmaar. Det
bekrefter Glimt på sitt nettsted.
Evjen har undertegnet en
kontrakt på 4,5 år med sin nye
klubb, der han starter 1. januar. –
Akkurat nå føles det bra, og jeg
er glad. Jeg tror dette kan bli
veldig spennende ...
Group Item
classification sports
pageclass article
person håkon evjen
person fredrik midtsjø
entity eliteserien
keyword bodø/glimt
location nederland
... ...

Represent users as vectors
• Load all pageview events for a month
• Map URLs in pageview events to the content profile for that URL
• Create a vector where each unique word is a column
• Store hit count of each word in the right column for each user
sports håkon evjen nederland eliteserien ...
4sk9yk1sb7v8yxas 40 2 1 28 ...
... ... ... ... ... ...

Computing user similarity
How to compare each user to a group of users
• Compute centroid (average vector) for the group
• Compute similarity for each user to the centroid
• Process batches of users
Scale quickly becomes a concern
• Millions of unique users is common
• A lot of possible words and phrases

75 000 000 x 8 700 000

• A measure of independence
• Find important words and phrases
• Reduce number of columns
• Easy to use in a Scikit-Learn Pipeline
sklearn.feature_selection.SelectKBest(score_func = chi2, …)
Pearson’s Chi-Squared Test

Cosine similarity
• Values between -1 and 1
• Closer to 1 means more similar
• Independent of vector length
• Reducing effect of amount of consumed content

CXENSE www.cxense.com
Ranking based on similarity score
Ranking of anonymous users
Pick users with highest
similarity as the lookalikes
Customers choose a fraction
[%] of the total unique users.
New segment
Store results as

• Billions of pageview events
• Hundreds of millions of unique users
• Millions of unique URLs
And it keeps growing!
Dataset Size & Scaling

Optimize, run, optimize again
• Parallelize on every layer: threads, processes, jobs
• Keep the memory usage under control
• Use gRPC for data transfer whenever possible
• Stream big API responses directly to disk

Optimizing Scipy Methods
for b in range(n_batches):
...
indices = np.hstack((indices, new_matrix.indices.astype(np.int32)))
indptr = np.hstack((indptr, (new_matrix.indptr.astype(np.int64) +
len(values))[1:]))
values = np.hstack((values, new_matrix.data.astype(np.int16)))
matrix = sp.sparse.csr_matrix((values, indices, indptr),
shape=(len(url_sets), vocab_length))
Creating a sparse matrix is easy using Scipy.
Until you discover that their approach for stacking matrices is inefficient.

www.cxense.com
Feel free to reach out via LinkedIn or the Meetup forum!Questions?
… and by the way: Amerikanske Piano byr 351 mill for medieselskapet Cxense

CXENSE 2017 | Large-Scale User Similarity Modeling

Recommended

Recommended

More Related Content

Similar to CXENSE 2017 | Large-Scale User Similarity Modeling

Similar to CXENSE 2017 | Large-Scale User Similarity Modeling (20)

Recently uploaded

Recently uploaded (20)

CXENSE 2017 | Large-Scale User Similarity Modeling