MasterSearch_Meetup_AdvancedAnalytics

Data science @
RTL Nederland
Longhow Lam
@longhowlam
1st Master Search
Advanced Analytics meetup
5-juli-2017

Agenda
 RTL Data science set up
 Some data science topics @RTL
 Text mining
 Computer vision
 Association rules
 Power BI

RTL Data science set up
Source Data
 Click data
 Heartbeat data
 Account data
 Location
 Metadata
 Campaign data
 Etc..
Data science team
4 data engineers
4 data scientists
The main businesses at RTL we work for
ETL processes
Find out for your self:
https://www.rtl.nl/werkenbij/
Use cases
 Churn modeling
 Response modeling
 Customer segmentation
 Look-alikes-for Advertisers
 Recommendation engines

Content similarity
Which movies on Videoland are close to each other?
Which news articles on RTL Nieuws are close to each other?
movies, we can look at the movie summaries or Video captures
news articles, we can look the text of the articles or corresponding news image
Hence text mining and computer vision

Text mining
2000 movie plots / summaries on VideoLand
For each movie plot: count the words / terms
Put the counts in a so-called term document matrix
There are around 50.000 terms in the 2000 movie plots
Usually this matrix is very sparse
Aap Film Auto …. Leven …. …. Zwaar
Film1 1 4 8
Film2 10
Film3 5
… 1
…
… 6 6
Film2000 1 8
Term document matrix

Similarity: cosine similarity
Between movies (so between rows of the matrix) we can now calculate similarities
A distance that is often used is cosine similarity
Visually we can see this distance in the following figure:
Suppose we only have two terms:
1. Leven
2. Spannend
# leven
# Spannend
Film 1
Film 2
cosine similarity

VideoLand To get a feeling for Movie similarities we created a small shiny app

RTL Nieuws power-bi dashboard to see article similarties

Two approaches used @RTL
Computer Vision API from Vendors (Microsoft, Clarifai, Google,…)
Tweak things ourselves with Keras/Tensorflow
+ Ready and Easy to use (Just send your image to them)
+ Not too expensive ($0.84 / 1000 images)
- No control on what is returned
- Takes more effort to set it up
- Needs more knowledge
+ More control on what you are doing

RTL Nieuws image: API examples
Feature Name Value
Description { "type": 0, "captions": [ { "text": "a group of people
sitting on a table", "confidence":
0.4894670976127814 } ] }
Tags [ { "name": "person", "confidence":
0.996391236782074 }, { "name": "indoor",
"confidence": 0.9104063510894775 }, { "name":
"people", "confidence": 0.7057779431343079 } ]
Image Format Jpeg
Image Dimensions 4096 x 3078
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-LineDrawing
Black & White Image False
Is Adult Content False
Adult Score 0.042066238820552826
Is Racy Content False
Racy Score 0.061784882098436356
Categories [ { "name": "people_many", "score": 0.9296875 } ]
Faces [ { "age": 52, "gender": "Male", "faceRectangle": {
"width": 298, "height": 298, "left": 433, "top": 1370 }
}, { "age": 78, "gender": "Male", "faceRectangle": {
"width": 269, "height": 269, "left": 3212, "top": 1410
} }, { "age": 64, "gender": "Male", "faceRectangle": {
"width": 241, "height": 241, "left": 2108, "top": 1534
} } ]

Feature Name Value
Description { "type": 0, "captions": [ { "text": "Linda de Mol
talking on a cell phone", "confidence":
0.46178352459016536 } ] }
Tags [ { "name": "person", "confidence":
0.9999904632568359 }, { "name": "outdoor",
"confidence": 0.9974232912063599 }, { "name":
"woman", "confidence": 0.9967917799949646 }, {
"name": "lady", "confidence":
0.7658315896987915 } ]
Image Format Jpeg
Image Dimensions 1024 x 421
Clip Art Type 0 Non-clipart
Line Drawing Type 0 Non-LineDrawing
Black & White
Image
False
Is Adult Content False
Adult Score 0.009753250516951084
Is Racy Content False
Racy Score 0.014254707843065262
Categories [ { "name": "people_portrait", "score": 0.96875 } ]
Faces [ { "age": 28, "gender": "Female", "faceRectangle": {
"width": 282, "height": 282, "left": 286, "top": 35 } }
]
RTL Nieuws image: API examples

Tweak things ourselves with Keras
 Keras is a high-level neural networks API running on top of
 either TensorFlow
 or Theano.
 and now also CNTK
 Developed for fast experimentation.
 Easier to use than tensorflow, but you still have lot’s of options
 There is now also an R interface (of course created by Rstudio… )

Keras: Simpel set-up “Architecture”
Tensorflow installed on a (linux) machine
Ideally with lots of GPU’s 
pip install keras
You’re good to go in Python
(Jupyter notebooks)
install_github("rstudio/keras")
You’re good to go in
R / RStudio

Example in R: Neural network with two hidden layers
Pixel 3
Pixel 2
Pixel 1
Pixel 783
Pixel 784
Label 0
Label 9

Using pre-trained models
Image classifiers have been trained on big GPU machines
for weeks with millions of pictures on very large networks
Not many people do that from scratch. Instead, one can
use pre-trained networks and start from there.
VGG19 deep learning model
143 million weights!!!

predict image class using pretrained models

RTL NIEUWS Images labeled with resnet and vgg16
Link to trellisJS app

Extract features using pre-trained models
Remove top layers for feature extraction
We have a 7*7*512 ‘feature’ tensor = 25.088 values

RTL NIEUWS Image similarity
1024 RTL Nieuws Sample pictures. Compute for each image the 25.088 feature values.
Calculate for each image the top 10 closest images, based on cosine similarity.
Little Shiny APP

Examples RTL Nieuws image similarities

The Brad Pitt
Similarity index

Take five Brad Pitt pictures
Run them trough the pre-trained
vgg16 and extract feature vectors.
This is a 5 by 25088 matrix
The brad Pit Index
Take other images, run them through the VGG16
Calculate the distances with the five Brad Pitt pictures and average:
0.771195 0.802654 0.714752 0.792587 0.8291976 0.80969440.665990 0.9737212

0.6273 0.5908 0.8231 0.7711 0.8839 0.8975 0.6934 0.9659
Focusing on only the face!!

Can you shake hands with your neighbor?
A little Statistical Experiment

50.1% of people don’t wash their
hands after visiting the toilet

84.6% of all statistics are just
made up on the spot !!

Association Rules Mining
Market basket analysis
 Association rules mining (arm)
Mixture of different methods
 Ensemble
ARM is one of several so called collaborative filter algorithms
Collaborative filtering is a method of making recommendations
about the interests of one user (filter) by collecting preferences
or behavior from many users (collaborating).
Memory-based algorithms
 Slope one (slope1)
 K nearest neighbors (knn)
Model-based algorithms
 Matrix factorization methods

Association rule mining
The basics
 Identify frequent item sets (or rules) in the customer transaction data:
 IF item X THEN item Y
 IF item A and B THEN item very likely item C
 Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule
# trxs. {X}  {Y}
Total # trxs.
Support (X,Y) =
Lift (X,Y) =
Support (X,Y)
Support (X) * Support(Y)
Support & Lift
GTST  Nieuwe Tijden 10.8%
Star trek  GTST 0.018%
For example a lift of 2.5 means:
If people have watched movie X then they are 2.5 more
likely to watch movie Y than if they didn’t watch movie X

Association rules virtual items
User Movie
1 Blacklist
1 Startrek
1 James bond
2 Kill Bill
2 Pulp fiction
3 Stargate
3 Men in Black
An old trick with association rules mining is to add ‘virtual’ items
User Virtual item
1 Blacklist
1 Startrek
1 James bond
1 Male
1 [25-30) Y
2 Kill Bill
2 Pulp fiction
2 Female
2 [40-45) Y
3 Stargate
3 Men in Black
2 Male
2 [50-55) Y
Rules that now might appear are for example:
 Male, [40-45), Startrek  James Bond
 Female, [20-25), Kill Bill  Pulp Fiction

Association rules with R and Gephi

Survival curve
At which moment in an episode do people stop watching?
Can we compare different episodes and series?
Survival Curves!!
For a specific Episode from a specific Serie:
 Take all Videoland streams: Starts / Stops from
 Determine completion rate, and rank all streams on completion rate
 Calculate empirical distribution F
 Survival: S =1 – F
 Do this for all episodes and series

MasterSearch_Meetup_AdvancedAnalytics

MasterSearch_Meetup_AdvancedAnalytics

Recommended

Recommended

More Related Content

Similar to MasterSearch_Meetup_AdvancedAnalytics

Similar to MasterSearch_Meetup_AdvancedAnalytics (20)

More from Longhow Lam

More from Longhow Lam (14)

Recently uploaded

Recently uploaded (20)

MasterSearch_Meetup_AdvancedAnalytics