Exploratory Analysis of User Data

Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Exploratory Analysis of User Data
Behrooz Omidvar-Tehrani

Research Scientist at Grenoble AI Institute

http://www.omidvar.info
Intensive course in RAIS summer school, 17-19 May 2021

• Behrooz Omidvar-Tehrani, PhD in Computer Science and Applied Mathematics

• Research focus on interactive data analysis, at the crossroad of machine leaning, data science, and data mining.
About the instructor
2
Postdoctoral Researcher at The Ohio State University

2016-2017
Postdoctoral Researcher at The Grenoble Alpes University

2017-2018
Research Scientist at Naver Labs Europe

2019-2020
Research Scientist at Grenoble AI Institute

2021-Present

Why user data?
• Because user data is ubiquitous.

• Users are very active on the Web generating user data.

• Here is what has happened in last 5 minutes on the Web (per http://pennystocks.la/internet-in-real-time):
3
3M new tweets
posted in Twitter
24M videos
watched in Youtube
274K photos uploaded
in Instagram
8M photos liked in
Instagram
22M searches
performed in Google
16M posts added in
Facebook
12M messages sent in
WhatsApp
51K video hours
watched in Net
fl
ix
1M users participated
in a Zoom call

Hunger for user data
• The number of requests to obtain user data has
 
been increased drastically.

• Google received 48,941 government data requests
 
affecting 83,345 user accounts in the
fi
rst six months
 
of 2017. The United States issued 16,823 of these
 
requests.

• Dataset Search indexes almost 25 million
 
user datasets. (https://blog.google/products/
 
search/discovering-millions-datasets-web/)
4

Why analyzing user data?
• In general, data analysis means to “collect data” and “provide insights”.

• User data analysis means to extract value from user data → behavioral analytics

• It unveils insights into the behavior of customers.
5
Net
fl
ix movie recommendation

© UX Collective
Amazon product recommendation

© MagePlaza
Analytical dashboards for business insights

© Marketing Land
Automated medical analysis

© 123 RF
[Omidvar-Tehrani and Amer-Yahia, TKDE’19]

• User data is voluminous and noisy, hence hard to get insights from.

• Often an analysis pipeline is designed to tackle the challenges of volume and noise.

• We often call it in its abbreviated form as UDA pipeline.

• Why post-processing?
 
Because mined results and recommendations need to be rendered in a human-understandable form.

• Why user data presentation?
 
When digesting the insights, the human brain performs better on visual elements than on textual information.

• Why user data exploration?
 
An exhaustive scan through all discovered groups is not possible for users.
User data analysis pipeline
6
Raw user data
User Data
Preparation
towards less
noise
towards less
volume User Data Mining,

Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
[Omidvar-Tehrani, Amer-Yahia, Simon @ HILDA’19]

User roles in UDA pipelines
• Users with different roles and needs write UDA pipelines to achieve tasks.
7
Data scientist Domain expert Information consumer
who brings

analysis expertise
who brings

domain knowledge
who brings

task

Objectives and the timeline of the course
8
Objectives

• Motivate UDA and UDA pipelines and illustrate its importance in practice

• Understand the underlying structure of user data in its general form

• Walk through the UDA pipelines and discuss its components, from preparation to exploration

• Work on hands-on experiences to observe the challenges of UDA implementation in practice

• Get familiar with the state of the art in UDA research

Timeline

• Session 1. Monday 17 May 2021 at 10:30 - 12:30 (Introduction, User Data Preparation and Visualization)

• Session 2. Tuesday 18 May 2021 at 10:30 - 12:30 (User Data Mining and Recommendation)

• Session 3. Wednesday 19 May 2021 at 10:30 - 12:30 (User Data Exploration with Reinforcement Learning)

Topics covered in the course
9
Raw user data
User Data
Preparation
towards less
noise
towards less

Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3

9
Raw user data
User Data
Preparation
towards less
noise
towards less

Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3
What is the general model behind all user datasets?

How to prepare user data for analysis?

How to increase the quality of user data?

9
Raw user data
User Data
Preparation
towards less
noise
towards less

Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3


How to make sense out of user data?

How to discuss user data with collaborators?

9
Raw user data
User Data
Preparation
towards less
noise
towards less

Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3


How to discover (mine) insights in user data?

How to build a recommender engine for user data?

How to recommend to a group of users?


9
Raw user data
User Data
Preparation
towards less
noise
towards less

Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3


How to discover (mine) insights in user data?

How to build a recommender engine for user data?

How to recommend to a group of users?

How to build interactive user data analysis systems?

How to learn interactions with user data?

How to guide users in labor-intensive tasks?

This course is interactive.

You participate in 10 polls throughout
the course.
Course material
10
Hands-on experiences

Some code templates will be delivered at
the end of each session to practice the
learned material.
Course slides

Available at http://www.omidvar.info/#activities
(“teaching”section)
Questions

Please use during the sessions.

For all other questions, email me at
behrooz@omidvar.info.

About exercises
11
Hands-on #1: Research paper
fi
nder

Practicing data crawling and data collection

Requirement: Python
Hands-on #2: D3 histogram

Practicing user data visualization

Requirement: Java Script and HTML
Hands-on #3: Mining user groups

Practicing user data mining and itemset mining

Requirement: Python, basic C, basic cmd
Hands-on #4: Multi-objective mining

Practicing multi-objective optimization

Requirement: Java
Hands-on #5: Recommendation

Practicing recommendation algorithms

Requirement: Python
Hands-on #6: Implementing exploration semantics

Practicing data / problem modeling

Requirement: Math and Logic
Hands-on #7: Designing a Markov Decision Process

Practicing Markov Decision Processes

Requirement: Math and Logic
Hands-on #8: RL for Exploratory User Data Analysis

Practicing reinforcement learning

Requirement: Python

• Question. You are a data scientist in a company owning terabytes of user data. They ask you to deliver some
good insights about their data but they don’t have any speci
fi
c questions to ask (or any hypotheses to form).
They only give you one week to deliver results. How do you prioritize your actions?
Poll: Prioritizing actions in user data analysis
12
A5
5 %
A4
5 %
A3
25 %
A2
30 %
A1
35 %
• Popular answers

• (A1) I start cleaning the data, building a visualization dashboard, and present
some insights using the dashboard.

• (A2) I prepare the data for exploration and ask the data owners to navigate in
the data and evaluate some hypotheses.

• (A3) I don't start the implementation, and I'll
fi
rst think on the paper for a bit,
in order to come up with a good pipeline plan.

• (A4) I start performing some predictions on the raw data, following some
post-processing steps.

• (A5) I will perform some mining on the raw data, following some post-
processing steps. Votes

Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event 13
Raw user data
User Data
Preparation
towards less
noise
towards

less volume User Data Mining,

Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
User Data Preparation and Visualization
SESSION 1

• User data is a (complex) bipartite graph between the set of users and the set of items .

• Attributes describe both users and items.
𝒰
ℐ
𝒜
User data model
14
User demographics
gender
age
occupation
location
health status
Users
𝒰
Items ℐ
movie medicine grocery
music book tweet
action
Temporal actions
[Omidvar-Tehrani, Amer-Yahia @ TKDE’20]

• Users are not independent entities and they are connected through social links.

• Social links can be explicit (friendship in Facebook, following Twitter, co-authorship), or implicit (like-minded
users).
Links between users
15
Mary and John are explicitly
linked through their
friendship in Facebook.
Mary is a female
engineer.
John is a male
student.
Elena and Amber are
implicitly linked through
their interest in drama-genre
movies.
Elena is a female
professor.
Amber is a female
pianist.
Elena likes The Godfather
(Crime, Drama).
Amber likes Titanic
(Romance, Drama).

• The simple bipartite structure of user data contains many pieces of useful information.
Simple data structure but rich value
16
Amber is a female
pianist.
Amber likes Titanic.
Item attributes. Titanic is produced in 1997 by James Cameron,
starring Leonardo DiCaprio and Kate Winslet.
Action attributes. Amber like the movie Titanic on 17
May 2021, at 3365 Indiana Street, San Diego, USA.
User groups. Amber belongs to the group of female
pianists in California with 34K members.
Abstract user groups. Amber also belongs to the group of
females, the groups of pianists, the group of Californians, and
the group of Titanic lovers.
Abstract user attributes.
Amber is also an artist.

• User data preparation is the process of preparing (raw) user data for UDA.

• The outcome of user data preparation is another version of user data with less noise.
User data preparation
17
Raw user data
User Data
Preparation
towards less
noise
towards


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
Extract, Transform,
Load (ETL)
User Data
Ingestion
User Data
Integration
User Data
Cleaning
User Data

Post-processing
(Augmentation, Delivery)

• The
fi
rst step in user data preparation is called ETL.

• Extraction of user data from a source is the
fi
rst phase of ETL. The literature often considers the “ingestion” and
“integration” steps also inside this
fi
rst part of ETL.

• Transform is a mediator phase to apply a set of rules and pre-de
fi
ned functions to prepare the data to load. The
literature often considers “data cleaning” also as a component of this ETL part.

• Load is the last phase to place the data in the hosting structure, such as a relational or NoSQL database.

Where to obtain (public) user data?

• Collect user data using Amazon Mechanical Turk, Survey Monkey, and other similar platforms.

• Crawl user data using BeautifulSoup and other similar libraries. The process is also called web scraping.

• Download the data from dataset repositories, e.g., UCI, Kaggle, Github, Google Dataset Search, Harvard
Dataverse, etc.
Extract, Transform, Load (ETL)
18

• We crawl data if no direct and easy access is available to the data under question.

• Before crawling, we always have to check copyright issues. Also note that some websites offer their own APIs.

• Webpages with some regularities are the best candidates for crawling.

• Beautiful Soup is a Python library for pulling data out of HTML (https://www.crummy.com/software/
BeautifulSoup/bs4/doc/).
Data acquisition using crawling
19
from bs4 import BeautifulSoup

import urllib2

url_template = "https://dblp.org/db/conf/sigmod/sigmod2020.html"

keywords = ["user data"]

page = urllib2.urlopen(url_template)

soup = BeautifulSoup(page, "html.parser")

papers = soup.findAll("span", {"class": "title"})

for paper in papers:

paper_str = paper.text

for keyword in keywords:

if paper_str.find(keyword) != -1:

print(paper_str)

break

• Task. Write a Python code that automatically
fi
nd all research papers (and their authors) about a given set of
keywords , where is an input parameter.

• Download the Python code paper-
fi
nder.py in the following link, and complete it: https://drive.google.com/
drive/folders/1M-HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing.

• Missing parts are marked with a TODO comment.
𝒲𝒲
Hands-on 1: Research paper
fi
nder
20
DM Authors dataset is build in the same way.

Available in PerSCiDO platform via https://doi.org/
10.18709/perscido.2016.10.ds32
[Omidvar-Tehrani, Amer-Yahia, Termier @ CIKM’15]

• Nowadays most web pages are highly dynamic, and such dynamic content is more arduous to coalesce.

• ScrapingBee is a library for headless web browsing. It emulates human behavior so that websites don’t block
the crawling process.

• Selenium is an open-source project for browser automation. The following code crawls a webpage protected
with login.
Advanced data collection
21
from selenium import webdriver

from selenium.webdriver.chrome.options import Options

options = Options()

options.headless = True

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)

driver.get("https://news.ycombinator.com/login")

print(driver.page_source)

login = driver.find_element_by_xpath("//input").send_keys(USERNAME)

password = driver.find_element_by_xpath("//input[@type='password']").send_keys(PASSWORD)

submit = driver.find_element_by_xpath("//input[@value='login']").click()

driver.quit()

• Data cleaning refers to a process of detecting and
 
removing noise in data.

• The cleanliness of data can be evaluated using
 
different measures such as validity, accuracy,
 
completeness, consistency, and uniformity.

• User data cleaning techniques:

• Dealing with missing values

• Dealing with outliers

• Data improvement

• Data tidy-up

• Scaling
User data cleaning
22

• Missing values are considered as noise.

• In user datasets, many attribute values are missing (e.g., gender, occupation, visitation date, etc.)

• When the data is missing, we either follow dropping or imputation technique.

• Dropping is often performed using a threshold.

• Imputation preserves the data size, hence more preferable to dropping.

• Numerical imputation. Consider a default value for the missing data for instance 0 to replace None. Median is
another value to consider (why not average?)

• Categorical Imputation. Replace the missing values with the maximum occurred value in a column, otherwise use
“other”.
User data cleaning techniques: missing values
23
threshold = 0.7

#Dropping columns with missing value rate higher than threshold

data = data[data.columns[data.isnull().mean() < threshold]]

#Dropping rows with missing value rate higher than threshold

data = data.loc[data.isnull().mean(axis=1) < threshold]

• Outliers are considered as potential noise.

• An outlier is a piece of data that doesn’t look normal.

• Methods for outlier detection are visualization (the most
 
effective method), standard deviation, and percentiles.

• If a value has a distance to the average higher than X times
 
standard deviation, it can be assumed as an outlier.

• A certain percent of the value from the top or the bottom
 
can be considered as an outlier.

• Outlier values can be either dropped or capped.

• Akin to missing data techniques, the former doesn’t maintain the
 
data size, while the latter does.
User data cleaning techniques: outliers
24
Is Brazil an outlier? What about Burundi?

• Data cleaning is not always about reducing noise, but also increasing the utility of user data.

• Examples of data improvement techniques are binning and log transform.
User data cleaning techniques: data improvement
25
Percentage binning Log transform

• A user dataset is called tidy iff every row represents a user and every column represents a feature.

• Tidy datasets are easy to manipulate, model and visualize.

• Grouping is the process of making an un-tidy data, tidy. Common grouping operations are average, sum, and
concatenation.

• Is ungrouping (tidy to untidy) necessary too?
User data cleaning techniques: data tidy-up
26
user score user score
u1 65 u3 60
u2 14 u2 30
u1 32 u1 90
user average score
u1 62.33
u2 22
u3 60
Transaction user dataset (un-tidy) Tidy user dataset
Grouping

Data cleaning frameworks
27
by Michael Stonebraker (ACM Turing Award winner) focusing on data mastering and uni
fi
cation.
Apple inductiv by Christopher Ré, Ihab Ilyas, and Theodoros Rekatsinas focusing on employing arti
fi
cial
intelligence to automate the task of identifying and correcting errors in data.
by same leaders of inductiv focusing on providing a a Machine Learning system for data repair and
predictions on structured data.
OpenCloud by NYU Data Science focusing on providing a Python library for data preprocessing and cleaning.
by Laure Berti-Equille focusing on providing a Python library for data preprocessing and
cleaning based on Q-Learning.

• Question. You are the head of a data engineering team in a healthcare company. Their user data is entered
manually by nurses and hence is noisy, which means it includes many missing and possibly inaccurate values
in patient information. How do you prioritize between the data cleaning techniques?
Poll: Prioritizing data cleaning techniques
28
Votes
0
1
3
4
5
Data cleaning techniques
Feature split Dropping Grouping Scaling Imputation Binning Log transform

User data visualization
• Sensemaking of user data using visual variables.

• A visualization component consists of three building blocks: views, visual variables and visual elements.
 
• Visualization can be done either at the beginning or at the end
 
of UDA pipelines, for hypothesis testing and validation,
 
respectively.

• At the core of visualizing user data is a mapping function that
 
associates user characteristics with visual variables.

• The following is the visualization of
 
MovieLens dataset.
 
29
(a)
View
Visual variables
Visual elements
[Zegarra et al., FGCS’20]
[Heer and Hellerstein, VLDB’09]

• User data can be visualized with typical visualization tools such as Tableau, or with more specialized approaches
such as graph-based or location/time-based visualization.
Types of visualization
30
Off-the-shelf
visualization
Graph-based
visualization
Geospatial and temporal
visualization
Application-dependent
visualization
NodeTrix
[Henry et al., TVCG’07]
Freund et al.: Bike-Sharing Analytics
10 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
Figure 2 The Screenshot Shows Older Versions of the Developed Map in NYC and Washington D.C.
Note. The circles on the map indicate to dispatchers which stations should have bikes added (in blue) and which
ones should have bikes removed (in red), with the area of each circle proportional to the recommended number. Map
data: c 2018 Google.
significant implications for Motivate’s operations. In particular, the unique minimum at
each station provides a natural target for rebalancing at a given point in time. Motivate
uses these target levels in a decision aid we developed to guide dispatchers over the course
Bike angels
[Chung et al., COMPASS’18]
19
Figura 2.7: Feature Driven System overview
Interesting phases of a single player can be automatically found by applying the clustering appro-
ach. In this figure, they analyze a forward and are interested in the attacks that the player was
involved. Resulting phases can be inspected using the small-multiples view (top-right panel) in
combination with the other rendering layers and Horizon Graphs (left and bottom panels).
projections, and compare it to traditional heatmaps.
Soccer analytics
[Machado et al., CG’17]
Players are users and their
actions are visualized to obtain
insights.
Dispatchers are informed for
adding (in blue) or removing (in
red) of bikes for the stations.
User groups are shown using
node-link diagrams and
adjacency matrices.
Visualization grammars
[Satyanarayan et al., TVCG’17]
Visual grammars facilitate
creating, saving, and
sharing visual analytics.

• D3.js is a JavaScript library web-based visualization. (Why web-based?)

• D3 stands for Data-Driven Documents.

• The starting point is often from the visualization zoo at
 
https://d3js.org.
Web-based visualization
31
Developed by Jeffrey Heer in
University of Washington
<div id="scatter_area"></div>

<script src="https://d3js.org/d3.v4.js"></script>

<script>

var margin = …

var svg = d3.select("#scatter_area") …

var data = [ {x:10, y:20}, {x:40, y:90}, {x:80, y:50} ]

var x = d3.scaleLinear() …

var y = d3.scaleLinear() …

svg.selectAll("whatever").data(data).enter() …

</script>
[Bostock et al., TVCG’11]

• D3.js is a JavaScript library web-based visualization. (Why web-based?)

• D3 stands for Data-Driven Documents.

• The starting point is often from the visualization zoo at
 
https://d3js.org.
Web-based visualization
31
Developed by Jeffrey Heer in
University of Washington
<div id="scatter_area"></div>

<script src="https://d3js.org/d3.v4.js"></script>

<script>

var margin = …

var svg = d3.select("#scatter_area") …

var data = [ {x:10, y:20}, {x:40, y:90}, {x:80, y:50} ]

var x = d3.scaleLinear() …

var y = d3.scaleLinear() …

svg.selectAll("whatever").data(data).enter() …

</script>
var x = d3.scaleLinear()

.domain([0, 100])

.range([0, width]);

svg.append('g')

.attr("transform", "translate(0," + height + ")")

.call(d3.axisBottom(x));
[Bostock et al., TVCG’11]

Hands-on 2: D3 histogram
32
The following
fi
gure shows that the peak hours
were around 11AM and 5PM. It also shows that
no log-in was done early morning.
$ python -m SimpleHTTPServer 8000 // Python 2

$ python3 -m http.server 8000 // Python 3
• Task. We are given a CSV
fi
le including hours that users logged in to a platform under investigation. Visualize a
histogram for this data using D3.

• Download the content in the sub-folder D3-Histogram in the following link, and complete it: https://
drive.google.com/drive/folders/1f82RplHgLte223QoD99UIKEM3IJSV4y5?usp=sharing.


• Important. You need a virtual server to run
 
this example. You can simply use:

• Cross
fi
lter is JavaScript library focusing on fast multidimensional
fi
ltering for coordinated views.

• In other words, Cross
fi
lter brings interactivity to visualizations.

• Source
fi
les are accessible via https://github.com/cross
fi
lter/cross
fi
lter. See examples in https://
drarmstr.github.io/chartcollection/examples/#worldbank.
Cross
fi
lter
33
[Omidvar-Tehrani et al., ICDE’17]

• Various approaches have been proposed for the visualization of time-based activities of users, in an interactive
manner.

• EventFlow is an example of leveraging time dimension where groups of users are shown along their temporal
actions in a visual interface. (https://hcil.umd.edu/event
fl
ow/)
Time-based visualization
34
[Monroe et al., TVCG’13]
Group of patients with
common treatments
Length of treatments

• Behavioral analysis is to extract value from user data.

• User data is modeled as a bipartite graph with users on one hand and actions on the other.

• User data analysis pipeline contains user data preparation, mining and recommendation, presentation and
exploration.

• We often obtain user data by collecting, crawling (scraping), or downloading from dataset repositories.

• Main tasks in user data cleaning deals with missing values, outliers, data improvement,
 
data tidy-up, and data scaling.

• At the core of visualizing user data is a mapping function that associates user
 
characteristics with visual variables.

• Visualization of user evolution needs special care.
Takeaways from the
fi
rst session
35

Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event 36
Raw user data
User Data
Preparation
towards less
noise
towards


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
User Data Mining and Recommendation
SESSION 2

• One important task in UDA is to understand user behavior.

• Simply put, we’re interested to know “what users have done” by collecting their interactions with data.

• Understanding user behavior bene
fi
ts businesses, as it helps them envision what services to expand in the
future to increase both user satisfaction and revenue.
Understanding user behavior
37
Amazon product recommendation

© measuringu.com
Net
fl
ix movie recommendation

© Medium

• We employ user data for two separate tasks: mining and recommendation.

• Mining

• To understand and represent user behaviors in the captured data.

• A famous application in industry is cross-selling: “customer who bought this
 
item also bought …”.

• The fundamental assumption is that there exist groups of user activities formed
 
by like-minded users which constitute different instances of user behavior.
 
Hence the main action is grouping.

• Recommendation

• To predict future user behaviors in the captured data. Recommendation is great approach for personalization.

• The fundamental assumption is that there exist a latent relation in user interactions, which can also predict future
possible interactions. Hence the main action is relation discovery.
User data mining and recommendation
38
http://cliintel.com/diapers-beer-and-data-in-retail/

User data mining
39
• The main action in user data mining is grouping, which is often resided in an unsupervised context.

• We need two elements to group users: a distance function, and representation approach.

• The distance function imposes the grouping / mining semantics. It enforces how two users should / should not
be placed in a common group. Sometimes it is called similarity function.

• The representation approach de
fi
nes how each mined group should be labeled. In the following example,
majority voting is used for representation.
Mia likes 60 drama movies and
40 action movies.
Group of drama-
genre lovers
Group of action-
genre lovers
distance?
distance?

Myriads of grouping methods
40
Community and Clique Detection
[Newman, Physical J.’04]
[Barbieri et al., ICDM’13]
[Goyal et al., CIKM’08]
Team and Tribe Formation
[Nikolaev et al., KDD’16]
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Aged 18−29
Aged 30−44
Aged 45+
Aged under 18
Females
Females Aged 18−29
Females Aged 30−44
Females Aged 45+
Females under 18
IMDb staff
Males
Males Aged 18−29
Males Aged 30−44
Males Aged 45+
Males under 18
Non−US users
Top 1000 voters
US users
0.0
2.5
5.0
7.5
Average
The Social Network, 7.7/10
1 2 3 4 5
The Blair Witch Project (1999)
0.0
0.4
0.8
Population: All, Average: 3
1 2 3 4 5
American Beauty (1999)
0.0
0.4
0.8
Population: All, Average: 4.3
1 2 3 4 5
American Beauty (1999)
0.0
0.4
0.8
Population: Middle-Age, Boston,
Average: 3.17
(a) (b) (c)
gure 1: (a) Segments on IMDb (b) Segments’ Distributions (c) Segments Exploration with Rating Maps
ween the rating distribution of a segment and an input
tribution of interest. Second, a scalable algorithm for
ploring the huge search space and dynamically building
ing maps is imperative. Finally, the segments forming a
p must satisfy certain quality criteria: coverage of input
ing records, diversity in segment description to show dif-
ent facets of the rater population, size of each segment
., not too small), and high proximity of each segment to
input distribution.
n a nutshell this paper makes the following contributions:
1. We show that several sophisticated distance measures
to discriminate between distributions. We show that the
rth Mover’s Distance (EMD) [20] is able to capture subtle
erences between two distributions and is appropriate for
building rating maps. Section 3 performs a study of various
distance measures. In Section 4.2, we discuss DTAlg, along
with the RF heuristics. Our experimental study and findings
are given in Section 5. Related work is discussed in Section 6.
Section 7 summarizes and concludes the paper.
2. DATA MODEL
A rated dataset consists of a set of users with schema
SU , items with schema SI and rating records with schema
SR. For example, SU = huid, age, gender, state, cityi
and a user instance may be hu1, young, male, NY , NYCi.
Similarly, movies on IMDb can be described with SI =
hitem id, title, genre, directori, and the movie Titanic
Segment Discovery
[Amer-Yahia et al., WWW’2017]
Pattern and Cube Mining
[Xin et al., KDD’06]
[Kamat et al., ICDE’14]
Clustering and Partitioning
[Agrawal et al., ACM’1998]
[Pedreira et al., VLDB’16]
Cohort Representation
[Jiang et al., VLDB’16]
[Omidvar-Tehrani, Amer-Yahia,
Lakshmanan @ DSAA’18]

• One common mining approach is clustering.

• K-means is the status quo in clustering which is an iterative expectation-maximization (EM) approach to update
the parameters of each cluster until convergence.

• Cluster centroids are representatives.

• K-means is a hard clustering method.

• K-means clusters are radial.
Clustering: k-means
41
input parameter k

centroids ← k random users

repeat until convergence:

for all users:

find the centroid closest to the user

assign the user to the cluster of that centroid (expectation)

update the centroid (maximization)
return k centroids
© iChrome
k=2 k=3
k=4 k=5

• Hard clustering bears no uncertainty.

• In real user data, users often belong to more than one group.

• A generalized non-hard clattering approach is Gaussian Mixture
 
Models (GMM).

• The idea is to represent each cluster with a Gaussian distribution
 
in lieu of a centroid. Hence the whole model contains k different
 
distributions.

• The objective is to maximize the
fi
t between the data points in each
 
cluster and its representative distribution, using maximum likelihood estimation.
Clustering: Gaussian Mixture Models (GMM)
42
© Oscar Contreras Carrasco

@ towards data science

• Density-based spatial clustering of applications with noise (DB-Scan) is grouping method based on both
distance and minimum number of points. The combination of the two parameters creates a notion of
neighborhood. The resulting clusters are not necessarily radial.
Clustering: DB-Scan
43
© KDnuggets
input parameter d and nbu // d = distance, nbu = minimum number of users

find the users in the d-neighborhood of every user, and identify core users with more than nbu neighbors.

find the connected components of core users on the neighbor graph, ignoring all non-core users.

assign each non-core user to a nearby cluster if the cluster is a d-neighbor, otherwise noise.

return clusters

• Clustering algorithms can also be employed as commodity using high-level Python libraries.

• Among many successful libraries, scikit-learn is a popular and standard one.

• For k-means, given the data and the number of clusters, the library does the rest.

• For DB-Scan, given the data, the distance and the minimum number of users, the library does the rest.
Python libraries for clustering
44
# k-means

from sklearn.cluster import KMeans

import numpy as np

data = np.array([[1, 2], [1, 4], …)

clusters = KMeans(nb_clusters=2).fit(data)

print(clusters.labels_)

#[1, 1, 1, 0, 0, …]

print(clusters.predict([12, 3])

# 0
# DB-Scan

from sklearn.cluster import DBSCAN

import numpy as np

data = np.array([[1, 2], [2, 2], …)

clusters = DBSCAN(eps=3, min_samples=2).fit(data)

print(clusters.labels_)

# [ 0, 0, 0, 1, 1, -1, …]

predictions = clusters.fit_predict(new_data)

# 1

• From human perspective, the representativity of all previous grouping approaches is feeble.

• As explainability matters in AI (XAI trend), it is desirable to have a soft non-radial grouping method which
represents groups in a human-understandable form, e.g., “group of students who participate in RAIS summer
school.”

• Frequent Itemset Mining (FIM) is often considered as a method for market
 
basket analysis.

• The initial goal is to
fi
nd sets of products that are frequently bought together.

• Each frequent itemset is a describable group.
Frequent Itemset Mining
45
Some Rep
© S. Harris @ ScienceCartoonsPlus.com

FIM: De
fi
nitions
46
• We are given a set of items , where any subset of is an itemset.

• We are also given a transaction (un-tidy) dataset where each member of is an itemset.

• Given an itemset , is the number of transactions containing .

• An itemset is a frequent itemset if , where is the minimum support threshold.

• Given two item sets and , an association rule with con
fi
dence holds, if ( is the
minimum con
fi
dence threshold ), where .
ℐ ℐ
𝒯𝒯
X ⊆ ℐ support(X) X
X ⊆ ℐ support(X) ≥ δ δ
X ⊆ ℐ Y ⊆ ℐ X → Y c c ≥ δ′

δ′

c = (support(X ∪ Y))/(support(X))

FIM: Example
47
User watched
u1
User watched
u2
User watched
u3
User watched
u3
User watched
u5
User watched
u6
Transaction user dataset
{The Terminal, Forrest Gump, The Pianist} is a
frequent itemset.
 
absolute support = 4
 
relative support = 4/5 = 60%

{Forrest Gump, The Pianist} → {The Terminal} is an
association rule.
 
con
fi
dence = 4/6 = 66%

{The Pianist} → {The Terminal, Forrest Gump} is
another association rule.
 
con
fi
dence = 4/5 = 80%

FIM: Computation
48
• Apriori algorithm. It is a level-wise search (
fi
rst 1-itemsets, then 2-itemsets, …) which exploits the following
pruning opportunity: if an itemset is not frequent, then all its supersets are not frequent.

• For instance, if {Psycho, Unhinged} is not frequent, then of course {Psycho, Unhinged, The Pianist} won’t be
frequent either.

• For instance, given the minimum support threshold equal to 2, the itemset {young, CA, student} is not frequent,
and not its superset either.
[Agrawal et al., SIGMOD’93]
6 Multi-Objective Group Discovery on the Social Web (Technical Report)
ha2, v2i, . . . , han, vni}, n  k, we say that g covers r, denoted as r l g, i↵
8i 2 [1, n], 9r.vj such that vj is a set of values for attribute g.ai and g.vj ✓
r.vi. For example, the rating hfemale, DC, student, 4i is covered by the group
{hgender, femalei, hlocation, DCi}.
{}
#records=3662
{male, young}
#records=1588
{CA,
student}
#records=20
{male}
#records=2634
{young}
#records=2147
{CA}
#records=664
{student}
#records=184
{male, young,
CA}
#records=268
{male, young, CA, student}
#records=2
{young, CA}
#records=375
{male,
student}
#records=120
{male, CA}
#records=477
{young,
student}
#records=13
{young, CA,
student}
#records=2
{male, young,
student}
#records=13
{male, CA,
student}
#records=17
[Omidvar-Tehrani, Amer-Yahia, Dutot, Trystram @ PKDD’16]

FIM for mining describable groups of users
49
• We employ an ef
fi
cient implementation of Apriori called LCM for
mining groups in user data.

• Step 1. Identi
fi
ers for both users and items should be mapped to a
non-negative integer space (required by LCM). For instance if the
movie Titanic (as an item) is mapped to “25” and the user “John” is
also mapped to “120”, the tuple <120,25> means that John has
watched the movie Titanic.

• Step 2. We transform a tidy dataset to an un-tidy (transactional)
dataset, where each line represents one user and the whole item IDs
associated to the user will be listed in that line separated by space.

• Step 3. Run LCM to mine groups.

• Each line in the output
fi
le returned by LCM represents one group.
[Takeaki et al., Discovery Science ’04]
http://research.nii.ac.jp/~uno/code/lcm.html

• With the approach discussed in the previous slides, we can obtain groups solely on the co-occurrence of items.

• It is more desirable to mix demographics and items to obtain groups such as “middle-aged females in Grenoble
who watched The Terminal and Forrest Gump.”

• It is possible to encode user attributes in the same transactional database. Then LCM will give us full-
fl
edged
groups.
Full-
fl
edged behaviors in user data mining
50
user gender age movies watched
u1 F Young Terminal, Forrest., Pianist, Psycho, Unhinged
u2 F Middle Terminal, Forrest., Pianist, Unhinged
u3 M Middle Pianist
u4 F Young Forrest., Pianist
u5 F Middle Terminal, Forrest., Pianist, Psycho
u6 M Middle Terminal, Forrest., Pianist
movie code
Terminal 1
Forrest. 2
Pianist 3
Psycho 4
Unhinged 5
attribute value code
Female 101
Male 102
Young 103
Middle 104
line # Transaction
1 1 2 3 4 5 101 103
2 1 2 3 5 101 104
3 3 102 104
4 2 3 101 103
5 1 2 3 4 101 104
6 1 2 3 102 104
[1 2 3 101 104] (2) [2 5]
[Terminal Forrest. Pianist Female Middle] (2) [u2 u5]
un-tidy LCM
translate

Hands-on 3: Mining user groups
51
• Step 1. Find MovieLens 1M dataset dataset on a dataset repository and download. The dataset contains movies that
users appreciated watching. We only need the
fi
le ratings.dat.

• Step 2. Download the Python
fi
le pmr.py in the following link, complete it: https://drive.google.com/drive/folders/
1xMxGdcI2IGgTAhozDUqSfZAWzKVXfkjr?usp=sharing.

• Step 3. Run the code to obtain the output
fi
le pmr.txt.

• Step 4. Download LCM software from the following link: https://drive.google.com/drive/folders/
1xMxGdcI2IGgTAhozDUqSfZAWzKVXfkjr?usp=sharing.

• Step 5. Put the dataset
fi
le in the same folder as LCM.

• Step 6. Run LCM as follows:

• Step 7. Open the output
fi
le out.txt. Each line in the
fi
le out.txt represents a group in the following structure: [set of
items] (support) [set of users]. The description of the group is [set of items]. The set of group members is [set of users].

• Step 8. Try to
fi
nd 5 interesting user groups.
./lcm CfI -l 5 -u 100 pmr.txt 3 out.txt

• Question. Following the steps in the previous hands-on, what is the most challenging aspect of mining
groups which remains unsolved?
Poll: Challenge of mining user groups
52
Votes
0
1
2
3
4
Challenges of user data mining
Ef
fi
ciency Overlap Size of clusters Explainability Mechanism Binning

User data mining for advanced decision making
53
• Both clustering and frequent itemset mining are based on the idea of density maximization.

• But is density what the end-user really desire to achieve?

• Oftentimes, more quality measures are required, such as coverage, diversity, and variance.
© prototypr.io

Multi-objective optimization
54
• This makes a multi-objective optimization problem.

• Given set of ratings , identify all group-sets where each group-set satis
fi
es:

• is maximized;

• is maximized;

• is minimized;

• The problem is proved to be NP-Complete by a reduction from the Exact 3-Set Cover problem (EC3).
R G
coverage(G, R)
diversity(G, R)
diameter(G, R)
Ensuring that most input records belong to at least one group in the output.
Ensuring that found groups are as different as possible from each other.
Ensuring that ratings within each group are homogenous.
[Omidvar-Tehrani, Amer-Yahia, Dutot, Trystram @ PKDD’16]

Diameter objective
55
• Diameter is a simple but effective measure of variance in ratings.

• Below, we observe that most reviewers agree on a high score for the movie Godfather → minimum diameter.

• We also observe that the reviewers are divided when voting on Fifty Shades of Grey → maximum diameter.
Count
(%)
0
15
30
45
60
Rating scores
1 2 3 4 5 6 7 8 9 10
Rating Distribution
Other rating distributions like increasing, decreasing, heterogeneous, etc.
Rating distribution of
The Godfather (1972)
in IMDb
Homogeneous
Rating Distribution
Minimum diameter
Count
(%)
0
7.5
15
22.5
30
Rating Scores
1 2 3 4 5 6 7 8 9 10
Fifty Shades of Grey (2015)
in IMDb
Polarized Rating
Distribution
Maximum diameter
Count
(%)
0
15
30
45
60
Rating scores
1 2 3 4 5 6 7 8 9 10
Rating Distribution
Other rating distributions like increasing, decreasing, heterogeneous, etc.
The Godfather (1972)
in IMDb
Homogeneous
Rating Distribution
Minimum diameter
Count
(%)
0
7.5
15
22.5
30
Rating Scores
1 2 3 4 5 6 7 8 9 10
Fifty Shades of Grey (2015)
in IMDb
Polarized Rating
Distribution
Maximum diameter

Pareto group discovery
56
• A bottom-up exhaustive approach to discover Pareto front.

• Generating fewer plans makes a Multi-Objective optimization algorithm run faster.
Optimization-based User Group Management: Discovery,Analysis, Recommendation - November 6, 2015
Bottom-up exhaustive approach to discover Pareto front.
0.5
10
User Groups as Pareto Fronts
Diversity
0 1
0.5
Coverage
0
1
Candidate Group-set
Dominance Area
Rejected Group-set
Pareto Group-set
α-Dominance Area
α
Rejected Group-set in case of α-
dominance
Bottom-up exhaustive approach to discover Pareto front.
0.5
User Groups as Pareto Fronts
Diversity
0 1
0.5
Coverage
0
1
Candidate Group-set
Dominance Area
Rejected Group-set
Pareto Group-set
α-Dominance Area
α
Rejected Group-set in case of α-
dominance

An approximation algorithm for Pareto group discovery
57
1. Inputs are , ,

2. Output is the Pareto result set

3.

4. For all user groups do

1. ← Singleton group-set containing g

2. If is not -dominated by any other group-set , then add to

5. For do

1. For each possible group-set of size do

1. If is not -dominated by any other group-set , then add to

6. Return
k α > 1 R
𝒫𝒫
← ∅
g
G
G α ∈
𝒫
G
𝒫
n ∈ [2,k]
G n
G α ∈
𝒫
G
𝒫𝒫

Hands-on 4: Mining multi-objective user groups
58
• Step 0. We continue the previous hands-on. So we need the
mined groups.

• Step 1. Download and unzip the
fi
le MOMRI.zip at the
following https://drive.google.com/drive/folders/1M-
HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing. It is
a Java NetBeans project whose main package is
“MOQO.MRI” and whose main executable is MOMRI.java.

• Step 2. Run the algorithm. The output of the algorithm
 
reports the progress in
fi
nding Pareto plans.

• Step 3. Add a new objective to the optimizer.

• Download the documentation at https://drive.google.com/
fi
le/d/1BE1jL2Lp327_Lxb1MMudY2p6l1tG_Uj4/view?
usp=sharing.
Input data. The parameter “ds” (line 21 of MOMRI.java) specifies the name of the da
use. MovieLens 1M (ds=“ml1m”) is considered as the default dataset. You can also t
MovieLens 100K dataset (ds=“ml100k”). The method “read ratings()” in line 30 of
MOMRI.java reads ratings from the data file on disk. The data file is hosted in the “da
Executable file
Parameters
Output

Recommendation systems
59
• Recommendation systems are designed to automatically
fi
nd relevant and desirable items to be consumed
by users in the future.

• In general, those systems work by means of predicting items that are likely to be the most appealing to
users based on their preferences.

• Intuitively, the problem of recommendation reduces to
fi
lling missing values in the user-item interaction
matrix.
[Amer-Yahia and Benouaret, BigData’20]
Terminal Forrest. Pianist Psycho Unhinged
u1 5 4 5 4 3
u2 4 5 5
u3 4
u4 3 3
u5 3 2 3 2
u6 3 4 2
Question. How would u2 rate the movie Psycho in the future?

Answer. Probably like others users similar to u2, like u1 or u5.

Question. Is u2 more similar to u1 or u5?

Answer. Following their ratings for The Terminal, Forrest Gump,
and The Pianist, u2 is more similar to u1. Hence u2 would
probably rate Psycho around 4, like what u1 did.
Multi-scale rating user-item interaction matrix

Types of recommendation systems
60
• Rule-based approaches used to be the dominant method for recommendation. It is still used in industry.

• Most common state-of-the-art approaches are content-based
fi
ltering and collaborative
fi
ltering.

• Content-based
fi
ltering recommends items based on ones that the user liked before.

• Collaborative
fi
ltering recommend items which are popular among the neighbors of the user.
Nina likes 60 drama movies, 20 romance, and 20 action.
La Vie en Rose

(Biography, Drama)
Me before You

(Drama, Romance)
Memento
(Mystery, Thriller)
60% sim. 0% sim.
80% sim.
“Me before You” will be ranked higher than “La Vie en
Rose” in Nina’s content-based recommendation.
Nina’s taste overlaps with Stephanie and Charles.
more impact less impact
“La Vie en Rose” will be ranked higher than “Memento” in
Nina’s collaborative recommendation.
CONTENT-BASED
COLLABORATIVE
Stephanie has the same taste
as Nina and likes “La Vie en
Rose” more than “Me before
You”.
Charles ’s taste is somewhat
different form Nina’s, and he
likes “Memento” more than
“Me before You”.

Collaborative
fi
ltering
61
• As collaborative
fi
ltering (CF) captures “like-minded behaviors”, it is often a favorite recommendation option.

• Two methods are proposed for implementing a CF approach: memory-based and model-based.

• In a memory-based implementation, the entire user-item interaction matrix is employed.

• In a model-based implementation, a model of users is developed to learn their preferences.
Towards more simplicity
Towards more ef
fi
ciency
model-based memory-based

Similarity between users
62
• An important step in recommendation is to to compare all users to the input user and
fi
nd the one that is most
similar.

• This is done using Pearson correlation.

• To measure the similarity between the tastes of Sara and Anderson, let’s assume x is the taste vector fo Sara and
y is Anderson’s, both rating n movies.

• The value r could be in the range -1 to +1, where +1 means that Sara and Anderson have perfectly similar tastes,
and -1 means the opposite.

• In practice, this correlation cannot be computed for any single user, hence we often user a small sample.

Memory-based CF: user-based
• Common implementations are user-based and item-based. We practice the former.
63
import pandas as pd

import numpy as np

movies_df, ratings_df = read_data(…)

user_preferences = pd.DataFrame()

user_subset = ratings_df[ratings_df["movie_id"].isin(user_preferences["movie_id"].tolist())]

user_Subset_group = userSubset.groupby(["user_id"])

user_Subset_group = sorted(user_subset_group, key=lambda x: len(x[1]), reverse=True)

user_subset_group = user_subset_group[0:100]

pearson_correlation_dict = {}

for name, group in user_subset_group:

pearson_correlation_dict[name] = pearson_correlation(user_preferences, group)

top_users = pearson_correlation_dict.sort()[0:50].merge(ratings_df)

top_users_rating["weighted_rating"] = top_users_rating["sim"] * top_users_rating["rating"]

recommendation_df = top_users_rating.groupby("movie_id").sum()[["sim","weighted_rating"]]

recommendation_df.average().sort()

final_rec = movies_df.loc[movies_df["movieId"].isin(recommendation_df.head(10)["movieId"].tolist())]

• Task. We are given a list of liked movies. Provide top-10 recommendations.

• Download the Python
fi
le cf.py in the following link, and complete it: https://drive.google.com/drive/folders/1M-
HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing.

Hands-on 5: Memory-based collaborative
fi
ltering
64
© streamingclarity.com

Model-based CF: matrix factorization
65
• CF as a “neighborhood” method, focusing on maximizing “closeness”, does not handle scalability issues and
noise.

• CF performs on low-level (raw) data which does not capture well the similarities between users on higher
levels.

• Matrix Factorization is a solution for both aforementioned issues.

• Factorization is a simple but principle operator in mathematics, e.g., representing “12” with its factors, which
are “4’ and “3”.

• In the context of recommendation, it is the task of factorizing the user-item interaction matrix into two
matrices corresponding to users and items.

Singular Value Decomposition for Matrix Factorization
66
• Among different ways of factorizing matrices, Singular Value Decomposition (SVD) is of particular interest in the
recommendation domain.

• SVD is an algorithm that decomposes an interaction matrix R into into the “best” lower rank approximation of R.

• The main SVD equation is as follows: , where is the diagonal matrix of singular values (weights).
R = QΣPT
Σ
© CodingFox

Model-based CF with SVD
• To get the lower rank approximation, we employ SVD and maintain the top k latent features, which are the most
important underlying taste.

• For illustration purposes, we consider k = 2, but k ~ 50 is more natural.
67
import pandas as pd

import numpy as np

from scipy.sparse.linalg import svds

# step 1

ratings_df, users_df, movies_df = get_data(…)

# step 2

ratings_pivot_df = ratings_df.pivot()

U, sigma, Vt = svds(ratings_pivot_df, k = 2)

sigma = np.diag(sigma)

# step 3

predictions = np.dot(np.dot(U, sigma))

Terminal Forrest. Pianist
u1 4.5 3 ??
u2 5 5 2
Step 1 (original dataset)
f1 f2
u1 1.1 2.3
u2 2.1 1
f1 f2
Terminal 1.9 1
Forrest. 2.3 0
Pianist 0 2
f1 1.9 2.3 0
f2 1 0 2
Matrix U Matrix V Matrix Vt
u1 4.39 2.53 4.6
u2 4.99 4.83 2
Step 3 (reconstructed dataset)
Step
2

Deep learning for recommendation
68
• So far, we covered neighborhood and matrix factorization methods for recommendation.

• For more ef
fi
ciency and precision, we also look at deep approaches, i.e., the active trend in recommendation.

• Deep learning has hunger for data, hence we often user implicit-feedback data rather than explicit-feedback.
u1 5 4 5 4 3
u2 4 5 5
u3 4
u4 3 3
u5 3 2 3 2
u6 3 4 2
Explicit-feedback interaction matrix
u1 1 1 1 1 1
u2 1 1 1 0 0
u3 0 0 1 0 0
u4 0 1 1 0 0
u5 1 1 1 1 0
u6 1 1 1 0 0
Implicit-feedback interaction matrix

Neural Collaborative Filtering (NCF)
69
• We employ a simple but ef
fi
cient implementation of a deep neural network for recommendation, called Neural
Collaborative Filtering (NCF).
import pandas as pd

import numpy as np

import torch.nn as nn

ratings = read_data()

# make the algorithm scalable

ratings = filter_to(ratings, 0.1)

train_ratings, test_ratings = split_train_test(ratings)

# mark all seen data as “1” and …

# … pick a few negative examples

users, items, labels = make_implicit_data(train_ratings)

model = NCF(num_users, num_items, train_ratings, movies)

trainer = trainer(max_epochs=5)

trainer.fit(model)
[He et al. ArXiv’17]

69
fi
import pandas as pd

import numpy as np











trainer.fit(model)
# step 1

random_users = np.random.choice(ratings['user_id'].unique(),
size=int(len(ratings['user_id’].unique()) * 0.1), replace=False)

# step 2

ratings = ratings.loc[ratings[‘user_id'].isin(random_users)]

69
fi
import pandas as pd

import numpy as np











trainer.fit(model)
# step 1

random_users = np.random.choice(ratings['user_id'].unique(),
size=int(len(ratings['user_id’].unique()) * 0.1), replace=False)

# step 2

ratings = ratings.loc[ratings[‘user_id'].isin(random_users)]
# step 1

ratings['rank_latest'] = ratings.groupby(['user_id'])
['timestamp'].rank(method='first', ascending=False)

# step 2

train_ratings = ratings[ratings['rank_latest'] != 1]

test_ratings = ratings[ratings['rank_latest'] == 1]

NCF architecture
70
• Akin to the notion of latent factors in MF, the input to the network is user
and item embeddings.
class NCF():

def __init__() …

def forward(self, user_input, item_input):

user_embedded = self.user_embedding(user_input)

item_embedded = self.item_embedding(item_input)

vector = torch.cat([user_embedded, item_embedded], dim=-1)

vector = nn.ReLU()(self.fc1(vector))

vector = nn.ReLU()(self.fc2(vector))

pred = nn.Sigmoid()(self.output(vector))

return pred

def training_step(self, batch):

user_input, item_input, labels = batch

predicted_labels = self(user_input, item_input)

loss = nn.BCELoss()(predicted_labels)

return loss

def configure_optimizers(self):

return torch.optim.Adam(self.parameters())

def train_dataloader(self):

return DataLoader(ratings, batch_size=512, num_workers=4)
© James Loy @ Kaggle

Group recommendation
71
• The outcome of a typical recommendation engine is a personalized top-k recommendation list.

• What if a group of users want to receive recommendations that they all appreciate collectively?

• A naïve approach towards group recommendation is the creation of a virtual user.
Predictions for Olivia:

rating(“Me before You”) = 1
rating(“Memento”) = 3
Predictions for Julia:

Julia
Olivia Jacob
Predictions for Jacob:


Question. Which movie should the group watch together?

Answer. Consider them as a virtual user with average rating.

Question. The average for both movies will become 2.33!! Alternative?

Answer. Consider them as a virtual user with least misery.

Question. The least misery score for both is 1!! Alternative?

Answer. …!

Solving the group recommendation problem
72
• Problem. Given user group , return best items to recommend (denoted as ) to during period such that

• contains items.

• Every item in is new to all members of .

• There does not exist any other item whose score is higher than any item in .

• Solution. A top-k processing algorithm is proposed.

• We materialize lists such as static af
fi
nity, absolute preference and dynamic af
fi
nity, and then scan all lists in
round-robin fashion (like NRA) followed by a buffer update.

• We terminate using a stopping condition.
G k IG G p
IG k
IG G
IG
[Basu Roy et al., VLDBJ’10 and ICDE’14]

Top-k processing
73
• Top- processing is a series of algorithms with the aim of
fi
nding items that best answer a user’s query.

• The performance of a top-k processing algorithm is measured in
 
terms of number of sequential accesses (SAs) and random accesses
 
(RAs) it makes.

• For instance, you access your third favorite music on an audio tape
 
using an SA, and on Spotify (or essentially or hard drive) using an RA.

• The naïve computation of top-k is to compute the score of each item,
 
sort them in decreasing order, and return the top-k. When we have billions of items, this approach is infeasible.

• An alternative idea is to throw space at the problem, by pre-computing inverted lists and scanning them, with a
stopping condition.

• Famous algorithms in this genre are TA and NRA. We review the latter here.
k k

No-Random-Access (NRA) algorithm
74
• Access all lists sequentially and in parallel.

• After each cursor move compute

• Worst-case score , best-case score for each seen ( is an item, e.g., a movie or a book)

• Sort all seen items on ,breaking ties by

• if then

• add to buffer

•

• else if

• add to candidates

• Stop if candidates

• Return the top- items
W(r) B(r) r r
W(r) B(r)
W(r) > mink
r
mink = min(W(r′

) ∀r′

∈ B)
B(r) > mink
r
B(d′

) ≤ mink ∀d′

∈
k
Predictions for
Julia
Titanic, 1
Terminal, 0.2
Predictions for
Jacob
God Father, 3.3
Titanic, 1.4
Predictions for
Olivia
Titanic, 2.3
God Father, 0.1
…
1
2
1
2 …
Sequential
access
(SA)
Random access (RA)

NRA example (step 1)
75
• We initialize cursors at the head of each list. We assume (hence the buffer size) and we have
space to keep track of 10 candidates.
k = 3
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
0 SA’s have been performed hitherto.
mink = ?
[jump to end of this example]

76
• We move the cursors sequentially.

• We complete the buffer by adding movies r7, r1, and r2 to it.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
mink = ?

76

movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?

76

movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5

77
• Once the buffer is complete, we check whether a new movie is worth to be added to the buffer.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5

77
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
, should it be added to
the buffer?
worsecase(r3) = 4.5
bestcase(r3) = 6

77
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
the buffer?
worsecase(r3) = 4.5
bestcase(r3) = 6
Given that , then
YES!
worsecase(r3) > mink

78
• Some items will gradually transition from the buffer to the candidates (e.g., r2).
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r7 5 6.5
2 r3 4.5 6
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5

78
• Some items will gradually transition from the buffer to the candidates (e.g., r2).
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r7 5 6.5
2 r3 4.5 6
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5 gets updated but stays at 1.5.
mink

79
• We have to check the stopping condition after each SA.

• We stop if .
max(bestcase(candidates)) < mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
max

79
• We have to check the stopping condition after each SA.

• We stop if .
max(bestcase(candidates)) < mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
max
⚠ Should we stop? 6 1.5, then NO!
≮

80
• For any new movie, we check if it should be added to the buffer. After the buffer update, we check the
stopping condition.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
≮

80
stopping condition.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
the buffer?
worsecase(r4) = 4.5
bestcase(r4) = 5.75
≮

80
stopping condition.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
the buffer?
worsecase(r4) = 4.5
bestcase(r4) = 5.75
Given that , then
YES!
worsecase(r4) > mink
≮

81
• We update after any buffer update.
mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
mink = ?
mink = 1.5

81
mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
mink = ?
mink = 1.5
mink = 4.5

81
mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
mink = ?
mink = 1.5
mink = 4.5
⚠ Should we stop? 6.5 4.5, then NO!
≮

Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
82
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 7 SA’s have been performed hitherto.
mink = ?
mink = 5
⚠ Should we stop? 6.5 5, then NO!
≮

Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
82
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5
⚠ Should we stop? 6.5 5, then NO!
≮
Buffer
1 r3 5.75 5.75
2 r4 5.5 5.5
3 r7 5 6.5
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10

83
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10

• Some movies may not be worth to be added neither to the buffer nor to the candidates.
84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10

84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25

84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25
Given that , then
NO!
worsecase(r5) ≯ mink

84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25
Given that , then
NO!
Can we still keep it as a candidate?

84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25
Given that , then
NO!
Can we still keep it as a candidate? Given that , then NO!
bestcase(r5) ≯ mink

85
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10

86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10

86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3

86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3
Given that , then
NO!

86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3
Given that , then
NO!
Can we still keep it as a candidate?

86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3
Given that , then
NO!
Can we still keep it as a candidate? Given that , then NO!
bestcase(r6) ≯ mink

87
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10

87
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
We won’t add the movie r9 neither to
buffer nor to the candidates.

88
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
mink = ?
mink = 5.5
≮
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10

Exploratory Analysis of User Data

Exploratory Analysis of User Data

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Exploratory Analysis of User Data

Similar to Exploratory Analysis of User Data (20)

Recently uploaded

Recently uploaded (20)

Exploratory Analysis of User Data