The Codex of Business Writing Software for Real-World Solutions 2.pptx
CXENSE 2017 | Large-Scale User Similarity Modeling
1. CXENSE 2017 | DEXA www.cxense.com
Large-Scale User Similarity Modeling
Arne Sund, Head of Data Science at Cxense
2. CXENSE 2017 www.cxense.com
• Norwegian tech company with a global presence
• ~60 engineers incl. 5 data scientists
• Media and Publishing vertical
• Delivering solutions for:
• Insight into online user patterns
• Superior content recommendations
• Segmentation of users using ML / AI
• Create campaigns for targeting on own sites
A global SaaS company
3. CXENSE 2017 www.cxense.com
• Declining ad revenue
• The Duopoly
• How to attract digital-only subscribers
• How to keep existing subscribers
How to engage users online and keep them on
our site longer?
Challenges facing publishers online
4. CXENSE 2018 www.cxense.com
• Awareness
• Consideration
• Subscription
• Loyalty
Powering the journey from casual visitor to subscriber
Anonymous
User
Known
User
• Churn prevention
Revenue
5. CXENSE 2017 www.cxense.com
• To tailor content recommendations and offers/ads
• Small set of known users as input
• Find similar anonymous users
User Similarity Modeling
7. CXENSE 2017 www.cxense.com
Defining user similarity
Users Are What They Read
• Represent users as vectors of consumed content
• Hit count vector of words and phrases
• High-dimensional vector space
• Computed using multiple algorithms
8. CXENSE 2017 www.cxense.com
Pageview events
User 4sk9yk1sb7v8yxas visited adressa.no at 18:02
Augmented with additional details
• Device: Type, Brand, Browser, OS
• Location: Country, Region, City
• Referrer: URL, Type
• Engagement: Active time, scroll depth
• ...
9. CXENSE 2017 www.cxense.com
Content profiles
NLP
Evjen solgt til AZ Alkmaar.
Vingsensasjonen Håkon Evjen
forlater Bodø/Glimt etter
sesongen og blir proff i
nederlandske AZ Alkmaar. Det
bekrefter Glimt på sitt nettsted.
Evjen har undertegnet en
kontrakt på 4,5 år med sin nye
klubb, der han starter 1. januar. –
Akkurat nå føles det bra, og jeg
er glad. Jeg tror dette kan bli
veldig spennende ...
Group Item
classification sports
pageclass article
person håkon evjen
person fredrik midtsjø
entity eliteserien
keyword bodø/glimt
location nederland
... ...
10. CXENSE 2017 www.cxense.com
Represent users as vectors
• Load all pageview events for a month
• Map URLs in pageview events to the content profile for that URL
• Create a vector where each unique word is a column
• Store hit count of each word in the right column for each user
sports håkon evjen nederland eliteserien ...
4sk9yk1sb7v8yxas 40 2 1 28 ...
... ... ... ... ... ...
11. CXENSE 2017 www.cxense.com
Computing user similarity
How to compare each user to a group of users
• Compute centroid (average vector) for the group
• Compute similarity for each user to the centroid
• Process batches of users
Scale quickly becomes a concern
• Millions of unique users is common
• A lot of possible words and phrases
13. CXENSE 2017 www.cxense.com
• A measure of independence
• Find important words and phrases
• Reduce number of columns
• Easy to use in a Scikit-Learn Pipeline
sklearn.feature_selection.SelectKBest(score_func = chi2, …)
Pearson’s Chi-Squared Test
14. CXENSE 2017 www.cxense.com
Cosine similarity
• Values between -1 and 1
• Closer to 1 means more similar
• Independent of vector length
• Reducing effect of amount of consumed content
15. CXENSE www.cxense.com
Ranking based on similarity score
Ranking of anonymous users
Pick users with highest
similarity as the lookalikes
Customers choose a fraction
[%] of the total unique users.
New segment
Store results as
16. CXENSE 2017 www.cxense.com
• Billions of pageview events
• Hundreds of millions of unique users
• Millions of unique URLs
And it keeps growing!
Dataset Size & Scaling
17. CXENSE 2017 www.cxense.com
Optimize, run, optimize again
• Parallelize on every layer: threads, processes, jobs
• Keep the memory usage under control
• Use gRPC for data transfer whenever possible
• Stream big API responses directly to disk
18. CXENSE 2017 www.cxense.com
Optimizing Scipy Methods
for b in range(n_batches):
...
indices = np.hstack((indices, new_matrix.indices.astype(np.int32)))
indptr = np.hstack((indptr, (new_matrix.indptr.astype(np.int64) +
len(values))[1:]))
values = np.hstack((values, new_matrix.data.astype(np.int16)))
matrix = sp.sparse.csr_matrix((values, indices, indptr),
shape=(len(url_sets), vocab_length))
Creating a sparse matrix is easy using Scipy.
Until you discover that their approach for stacking matrices is inefficient.
19. www.cxense.com
Feel free to reach out via LinkedIn or the Meetup forum!Questions?
… and by the way: Amerikanske Piano byr 351 mill for medieselskapet Cxense