SlideShare a Scribd company logo
1 of 199
Download to read offline
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Exploratory Analysis of User Data
Behrooz Omidvar-Tehrani


Research Scientist at Grenoble AI Institute


http://www.omidvar.info
Intensive course in RAIS summer school, 17-19 May 2021
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Behrooz Omidvar-Tehrani, PhD in Computer Science and Applied Mathematics


• Research focus on interactive data analysis, at the crossroad of machine leaning, data science, and data mining.
About the instructor
2
Postdoctoral Researcher at The Ohio State University


2016-2017
Postdoctoral Researcher at The Grenoble Alpes University


2017-2018
Research Scientist at Naver Labs Europe


2019-2020
Research Scientist at Grenoble AI Institute


2021-Present
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Why user data?
• Because user data is ubiquitous.


• Users are very active on the Web generating user data.


• Here is what has happened in last 5 minutes on the Web (per http://pennystocks.la/internet-in-real-time):
3
3M new tweets
posted in Twitter
24M videos
watched in Youtube
274K photos uploaded
in Instagram
8M photos liked in
Instagram
22M searches
performed in Google
16M posts added in
Facebook
12M messages sent in
WhatsApp
51K video hours
watched in Net
fl
ix
1M users participated
in a Zoom call
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Hunger for user data
• The number of requests to obtain user data has


been increased drastically.


• Google received 48,941 government data requests


affecting 83,345 user accounts in the
fi
rst six months


of 2017. The United States issued 16,823 of these


requests.


• Dataset Search indexes almost 25 million


user datasets. (https://blog.google/products/


search/discovering-millions-datasets-web/)
4
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Why analyzing user data?
• In general, data analysis means to “collect data” and “provide insights”.


• User data analysis means to extract value from user data → behavioral analytics


• It unveils insights into the behavior of customers.
5
Net
fl
ix movie recommendation


© UX Collective
Amazon product recommendation


© MagePlaza
Analytical dashboards for business insights


© Marketing Land
Automated medical analysis


© 123 RF
[Omidvar-Tehrani and Amer-Yahia, TKDE’19]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• User data is voluminous and noisy, hence hard to get insights from.


• Often an analysis pipeline is designed to tackle the challenges of volume and noise.


• We often call it in its abbreviated form as UDA pipeline.


• Why post-processing?


Because mined results and recommendations need to be rendered in a human-understandable form.


• Why user data presentation?


When digesting the insights, the human brain performs better on visual elements than on textual information.


• Why user data exploration?


An exhaustive scan through all discovered groups is not possible for users.
User data analysis pipeline
6
Raw user data
User Data
Preparation
towards less
noise
towards less
volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
[Omidvar-Tehrani, Amer-Yahia, Simon @ HILDA’19]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
User roles in UDA pipelines
• Users with different roles and needs write UDA pipelines to achieve tasks.
7
Data scientist Domain expert Information consumer
who brings


analysis expertise
who brings


domain knowledge
who brings


task
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Objectives and the timeline of the course
8
Objectives


• Motivate UDA and UDA pipelines and illustrate its importance in practice


• Understand the underlying structure of user data in its general form


• Walk through the UDA pipelines and discuss its components, from preparation to exploration


• Work on hands-on experiences to observe the challenges of UDA implementation in practice


• Get familiar with the state of the art in UDA research


Timeline


• Session 1. Monday 17 May 2021 at 10:30 - 12:30 (Introduction, User Data Preparation and Visualization)


• Session 2. Tuesday 18 May 2021 at 10:30 - 12:30 (User Data Mining and Recommendation)


• Session 3. Wednesday 19 May 2021 at 10:30 - 12:30 (User Data Exploration with Reinforcement Learning)
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Topics covered in the course
9
Raw user data
User Data
Preparation
towards less
noise
towards less
volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Topics covered in the course
9
Raw user data
User Data
Preparation
towards less
noise
towards less
volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3
What is the general model behind all user datasets?


How to prepare user data for analysis?


How to increase the quality of user data?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Topics covered in the course
9
Raw user data
User Data
Preparation
towards less
noise
towards less
volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3
What is the general model behind all user datasets?


How to prepare user data for analysis?


How to increase the quality of user data?
How to make sense out of user data?


How to discuss user data with collaborators?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Topics covered in the course
9
Raw user data
User Data
Preparation
towards less
noise
towards less
volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3
What is the general model behind all user datasets?


How to prepare user data for analysis?


How to increase the quality of user data?
How to discover (mine) insights in user data?


How to build a recommender engine for user data?


How to recommend to a group of users?
How to make sense out of user data?


How to discuss user data with collaborators?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Topics covered in the course
9
Raw user data
User Data
Preparation
towards less
noise
towards less
volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
SESSION 1
SESSION 2
SESSION 1
SESSION 3
What is the general model behind all user datasets?


How to prepare user data for analysis?


How to increase the quality of user data?
How to discover (mine) insights in user data?


How to build a recommender engine for user data?


How to recommend to a group of users?
How to make sense out of user data?


How to discuss user data with collaborators?
How to build interactive user data analysis systems?


How to learn interactions with user data?


How to guide users in labor-intensive tasks?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
This course is interactive.


You participate in 10 polls throughout
the course.
Course material
10
Hands-on experiences


Some code templates will be delivered at
the end of each session to practice the
learned material.
Course slides


Available at http://www.omidvar.info/#activities
(“teaching”section)
Questions


Please use during the sessions.


For all other questions, email me at
behrooz@omidvar.info.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
About exercises
11
Hands-on #1: Research paper
fi
nder


Practicing data crawling and data collection


Requirement: Python
Hands-on #2: D3 histogram


Practicing user data visualization


Requirement: Java Script and HTML
Hands-on #3: Mining user groups


Practicing user data mining and itemset mining


Requirement: Python, basic C, basic cmd
Hands-on #4: Multi-objective mining


Practicing multi-objective optimization


Requirement: Java
Hands-on #5: Recommendation


Practicing recommendation algorithms


Requirement: Python
Hands-on #6: Implementing exploration semantics


Practicing data / problem modeling


Requirement: Math and Logic
Hands-on #7: Designing a Markov Decision Process


Practicing Markov Decision Processes


Requirement: Math and Logic
Hands-on #8: RL for Exploratory User Data Analysis


Practicing reinforcement learning


Requirement: Python
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Question. You are a data scientist in a company owning terabytes of user data. They ask you to deliver some
good insights about their data but they don’t have any speci
fi
c questions to ask (or any hypotheses to form).
They only give you one week to deliver results. How do you prioritize your actions?
Poll: Prioritizing actions in user data analysis
12
A5
5 %
A4
5 %
A3
25 %
A2
30 %
A1
35 %
• Popular answers


• (A1) I start cleaning the data, building a visualization dashboard, and present
some insights using the dashboard.


• (A2) I prepare the data for exploration and ask the data owners to navigate in
the data and evaluate some hypotheses.


• (A3) I don't start the implementation, and I'll
fi
rst think on the paper for a bit,
in order to come up with a good pipeline plan.


• (A4) I start performing some predictions on the raw data, following some
post-processing steps.


• (A5) I will perform some mining on the raw data, following some post-
processing steps. Votes
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event 13
Raw user data
User Data
Preparation
towards less
noise
towards


less volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
User Data Preparation and Visualization
SESSION 1
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• User data is a (complex) bipartite graph between the set of users and the set of items .


• Attributes describe both users and items.
𝒰
ℐ
𝒜
User data model
14
User demographics
gender
age
occupation
location
health status
Users
𝒰
Items ℐ
movie medicine grocery
music book tweet
action
Temporal actions
[Omidvar-Tehrani, Amer-Yahia @ TKDE’20]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Users are not independent entities and they are connected through social links.


• Social links can be explicit (friendship in Facebook, following Twitter, co-authorship), or implicit (like-minded
users).
Links between users
15
Mary and John are explicitly
linked through their
friendship in Facebook.
Mary is a female
engineer.
John is a male
student.
Elena and Amber are
implicitly linked through
their interest in drama-genre
movies.
Elena is a female
professor.
Amber is a female
pianist.
Elena likes The Godfather
(Crime, Drama).
Amber likes Titanic
(Romance, Drama).
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• The simple bipartite structure of user data contains many pieces of useful information.
Simple data structure but rich value
16
Amber is a female
pianist.
Amber likes Titanic.
Item attributes. Titanic is produced in 1997 by James Cameron,
starring Leonardo DiCaprio and Kate Winslet.
Action attributes. Amber like the movie Titanic on 17
May 2021, at 3365 Indiana Street, San Diego, USA.
User groups. Amber belongs to the group of female
pianists in California with 34K members.
Abstract user groups. Amber also belongs to the group of
females, the groups of pianists, the group of Californians, and
the group of Titanic lovers.
Abstract user attributes.
Amber is also an artist.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• User data preparation is the process of preparing (raw) user data for UDA.


• The outcome of user data preparation is another version of user data with less noise.
User data preparation
17
Raw user data
User Data
Preparation
towards less
noise
towards


less volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
Extract, Transform,
Load (ETL)
User Data
Ingestion
User Data
Integration
User Data
Cleaning
User Data


Post-processing
(Augmentation, Delivery)
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• The
fi
rst step in user data preparation is called ETL.


• Extraction of user data from a source is the
fi
rst phase of ETL. The literature often considers the “ingestion” and
“integration” steps also inside this
fi
rst part of ETL.


• Transform is a mediator phase to apply a set of rules and pre-de
fi
ned functions to prepare the data to load. The
literature often considers “data cleaning” also as a component of this ETL part.


• Load is the last phase to place the data in the hosting structure, such as a relational or NoSQL database.


Where to obtain (public) user data?


• Collect user data using Amazon Mechanical Turk, Survey Monkey, and other similar platforms.


• Crawl user data using BeautifulSoup and other similar libraries. The process is also called web scraping.


• Download the data from dataset repositories, e.g., UCI, Kaggle, Github, Google Dataset Search, Harvard
Dataverse, etc.
Extract, Transform, Load (ETL)
18
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We crawl data if no direct and easy access is available to the data under question.


• Before crawling, we always have to check copyright issues. Also note that some websites offer their own APIs.


• Webpages with some regularities are the best candidates for crawling.


• Beautiful Soup is a Python library for pulling data out of HTML (https://www.crummy.com/software/
BeautifulSoup/bs4/doc/).
Data acquisition using crawling
19
from bs4 import BeautifulSoup


import urllib2


url_template = "https://dblp.org/db/conf/sigmod/sigmod2020.html"


keywords = ["user data"]


page = urllib2.urlopen(url_template)


soup = BeautifulSoup(page, "html.parser")


papers = soup.findAll("span", {"class": "title"})


for paper in papers:


paper_str = paper.text


for keyword in keywords:


if paper_str.find(keyword) != -1:


print(paper_str)


break
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Task. Write a Python code that automatically
fi
nd all research papers (and their authors) about a given set of
keywords , where is an input parameter.


• Download the Python code paper-
fi
nder.py in the following link, and complete it: https://drive.google.com/
drive/folders/1M-HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing.


• Missing parts are marked with a TODO comment.
𝒲𝒲
Hands-on 1: Research paper
fi
nder
20
DM Authors dataset is build in the same way.


Available in PerSCiDO platform via https://doi.org/
10.18709/perscido.2016.10.ds32
[Omidvar-Tehrani, Amer-Yahia, Termier @ CIKM’15]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Nowadays most web pages are highly dynamic, and such dynamic content is more arduous to coalesce.


• ScrapingBee is a library for headless web browsing. It emulates human behavior so that websites don’t block
the crawling process.


• Selenium is an open-source project for browser automation. The following code crawls a webpage protected
with login.
Advanced data collection
21
from selenium import webdriver


from selenium.webdriver.chrome.options import Options


options = Options()


options.headless = True


driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)


driver.get("https://news.ycombinator.com/login")


print(driver.page_source)


login = driver.find_element_by_xpath("//input").send_keys(USERNAME)


password = driver.find_element_by_xpath("//input[@type='password']").send_keys(PASSWORD)


submit = driver.find_element_by_xpath("//input[@value='login']").click()


driver.quit()
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Data cleaning refers to a process of detecting and


removing noise in data.


• The cleanliness of data can be evaluated using


different measures such as validity, accuracy,


completeness, consistency, and uniformity.


• User data cleaning techniques:


• Dealing with missing values


• Dealing with outliers


• Data improvement


• Data tidy-up


• Scaling
User data cleaning
22
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Missing values are considered as noise.


• In user datasets, many attribute values are missing (e.g., gender, occupation, visitation date, etc.)


• When the data is missing, we either follow dropping or imputation technique.


• Dropping is often performed using a threshold.


• Imputation preserves the data size, hence more preferable to dropping.


• Numerical imputation. Consider a default value for the missing data for instance 0 to replace None. Median is
another value to consider (why not average?)


• Categorical Imputation. Replace the missing values with the maximum occurred value in a column, otherwise use
“other”.
User data cleaning techniques: missing values
23
threshold = 0.7


#Dropping columns with missing value rate higher than threshold


data = data[data.columns[data.isnull().mean() < threshold]]


#Dropping rows with missing value rate higher than threshold


data = data.loc[data.isnull().mean(axis=1) < threshold]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Outliers are considered as potential noise.


• An outlier is a piece of data that doesn’t look normal.


• Methods for outlier detection are visualization (the most


effective method), standard deviation, and percentiles.


• If a value has a distance to the average higher than X times


standard deviation, it can be assumed as an outlier.


• A certain percent of the value from the top or the bottom


can be considered as an outlier.


• Outlier values can be either dropped or capped.


• Akin to missing data techniques, the former doesn’t maintain the


data size, while the latter does.
User data cleaning techniques: outliers
24
Is Brazil an outlier? What about Burundi?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Data cleaning is not always about reducing noise, but also increasing the utility of user data.


• Examples of data improvement techniques are binning and log transform.
User data cleaning techniques: data improvement
25
Percentage binning Log transform
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• A user dataset is called tidy iff every row represents a user and every column represents a feature.


• Tidy datasets are easy to manipulate, model and visualize.


• Grouping is the process of making an un-tidy data, tidy. Common grouping operations are average, sum, and
concatenation.


• Is ungrouping (tidy to untidy) necessary too?
User data cleaning techniques: data tidy-up
26
user score user score
u1 65 u3 60
u2 14 u2 30
u1 32 u1 90
user average score
u1 62.33
u2 22
u3 60
Transaction user dataset (un-tidy) Tidy user dataset
Grouping
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Data cleaning frameworks
27
by Michael Stonebraker (ACM Turing Award winner) focusing on data mastering and uni
fi
cation.
Apple inductiv by Christopher Ré, Ihab Ilyas, and Theodoros Rekatsinas focusing on employing arti
fi
cial
intelligence to automate the task of identifying and correcting errors in data.
by same leaders of inductiv focusing on providing a a Machine Learning system for data repair and
predictions on structured data.
OpenCloud by NYU Data Science focusing on providing a Python library for data preprocessing and cleaning.
by Laure Berti-Equille focusing on providing a Python library for data preprocessing and
cleaning based on Q-Learning.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Question. You are the head of a data engineering team in a healthcare company. Their user data is entered
manually by nurses and hence is noisy, which means it includes many missing and possibly inaccurate values
in patient information. How do you prioritize between the data cleaning techniques?
Poll: Prioritizing data cleaning techniques
28
Votes
0
1
3
4
5
Data cleaning techniques
Feature split Dropping Grouping Scaling Imputation Binning Log transform
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
User data visualization
• Sensemaking of user data using visual variables.


• A visualization component consists of three building blocks: views, visual variables and visual elements.


• Visualization can be done either at the beginning or at the end


of UDA pipelines, for hypothesis testing and validation,


respectively.


• At the core of visualizing user data is a mapping function that


associates user characteristics with visual variables.


• The following is the visualization of


MovieLens dataset.


29
(a)
View
Visual variables
Visual elements
[Zegarra et al., FGCS’20]
[Heer and Hellerstein, VLDB’09]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• User data can be visualized with typical visualization tools such as Tableau, or with more specialized approaches
such as graph-based or location/time-based visualization.
Types of visualization
30
Off-the-shelf
visualization
Graph-based
visualization
Geospatial and temporal
visualization
Application-dependent
visualization
NodeTrix
[Henry et al., TVCG’07]
Freund et al.: Bike-Sharing Analytics
10 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!)
Figure 2 The Screenshot Shows Older Versions of the Developed Map in NYC and Washington D.C.
Note. The circles on the map indicate to dispatchers which stations should have bikes added (in blue) and which
ones should have bikes removed (in red), with the area of each circle proportional to the recommended number. Map
data: c 2018 Google.
significant implications for Motivate’s operations. In particular, the unique minimum at
each station provides a natural target for rebalancing at a given point in time. Motivate
uses these target levels in a decision aid we developed to guide dispatchers over the course
Bike angels
[Chung et al., COMPASS’18]
19
Figura 2.7: Feature Driven System overview
Interesting phases of a single player can be automatically found by applying the clustering appro-
ach. In this figure, they analyze a forward and are interested in the attacks that the player was
involved. Resulting phases can be inspected using the small-multiples view (top-right panel) in
combination with the other rendering layers and Horizon Graphs (left and bottom panels).
projections, and compare it to traditional heatmaps.
Soccer analytics
[Machado et al., CG’17]
Players are users and their
actions are visualized to obtain
insights.
Dispatchers are informed for
adding (in blue) or removing (in
red) of bikes for the stations.
User groups are shown using
node-link diagrams and
adjacency matrices.
Visualization grammars
[Satyanarayan et al., TVCG’17]
Visual grammars facilitate
creating, saving, and
sharing visual analytics.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• D3.js is a JavaScript library web-based visualization. (Why web-based?)


• D3 stands for Data-Driven Documents.


• The starting point is often from the visualization zoo at


https://d3js.org.
Web-based visualization
31
Developed by Jeffrey Heer in
University of Washington
<div id="scatter_area"></div>


<script src="https://d3js.org/d3.v4.js"></script>


<script>


var margin = …


var svg = d3.select("#scatter_area") …


var data = [ {x:10, y:20}, {x:40, y:90}, {x:80, y:50} ]


var x = d3.scaleLinear() …


var y = d3.scaleLinear() …


svg.selectAll("whatever").data(data).enter() …


</script>
[Bostock et al., TVCG’11]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• D3.js is a JavaScript library web-based visualization. (Why web-based?)


• D3 stands for Data-Driven Documents.


• The starting point is often from the visualization zoo at


https://d3js.org.
Web-based visualization
31
Developed by Jeffrey Heer in
University of Washington
<div id="scatter_area"></div>


<script src="https://d3js.org/d3.v4.js"></script>


<script>


var margin = …


var svg = d3.select("#scatter_area") …


var data = [ {x:10, y:20}, {x:40, y:90}, {x:80, y:50} ]


var x = d3.scaleLinear() …


var y = d3.scaleLinear() …


svg.selectAll("whatever").data(data).enter() …


</script>
var x = d3.scaleLinear()


.domain([0, 100])


.range([0, width]);


svg.append('g')


.attr("transform", "translate(0," + height + ")")


.call(d3.axisBottom(x));
[Bostock et al., TVCG’11]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Hands-on 2: D3 histogram
32
The following
fi
gure shows that the peak hours
were around 11AM and 5PM. It also shows that
no log-in was done early morning.
$ python -m SimpleHTTPServer 8000 // Python 2


$ python3 -m http.server 8000 // Python 3
• Task. We are given a CSV
fi
le including hours that users logged in to a platform under investigation. Visualize a
histogram for this data using D3.


• Download the content in the sub-folder D3-Histogram in the following link, and complete it: https://
drive.google.com/drive/folders/1f82RplHgLte223QoD99UIKEM3IJSV4y5?usp=sharing.


• Missing parts are marked with a TODO comment.


• Important. You need a virtual server to run


this example. You can simply use:
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Cross
fi
lter is JavaScript library focusing on fast multidimensional
fi
ltering for coordinated views.


• In other words, Cross
fi
lter brings interactivity to visualizations.


• Source
fi
les are accessible via https://github.com/cross
fi
lter/cross
fi
lter. See examples in https://
drarmstr.github.io/chartcollection/examples/#worldbank.
Cross
fi
lter
33
[Omidvar-Tehrani et al., ICDE’17]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Cross
fi
lter is JavaScript library focusing on fast multidimensional
fi
ltering for coordinated views.


• In other words, Cross
fi
lter brings interactivity to visualizations.


• Source
fi
les are accessible via https://github.com/cross
fi
lter/cross
fi
lter. See examples in https://
drarmstr.github.io/chartcollection/examples/#worldbank.
Cross
fi
lter
33
[Omidvar-Tehrani et al., ICDE’17]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Cross
fi
lter is JavaScript library focusing on fast multidimensional
fi
ltering for coordinated views.


• In other words, Cross
fi
lter brings interactivity to visualizations.


• Source
fi
les are accessible via https://github.com/cross
fi
lter/cross
fi
lter. See examples in https://
drarmstr.github.io/chartcollection/examples/#worldbank.
Cross
fi
lter
33
[Omidvar-Tehrani et al., ICDE’17]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Various approaches have been proposed for the visualization of time-based activities of users, in an interactive
manner.


• EventFlow is an example of leveraging time dimension where groups of users are shown along their temporal
actions in a visual interface. (https://hcil.umd.edu/event
fl
ow/)
Time-based visualization
34
[Monroe et al., TVCG’13]
Group of patients with
common treatments
Length of treatments
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Behavioral analysis is to extract value from user data.


• User data is modeled as a bipartite graph with users on one hand and actions on the other.


• User data analysis pipeline contains user data preparation, mining and recommendation, presentation and
exploration.


• We often obtain user data by collecting, crawling (scraping), or downloading from dataset repositories.


• Main tasks in user data cleaning deals with missing values, outliers, data improvement,


data tidy-up, and data scaling.


• At the core of visualizing user data is a mapping function that associates user


characteristics with visual variables.


• Visualization of user evolution needs special care.
Takeaways from the
fi
rst session
35
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event 36
Raw user data
User Data
Preparation
towards less
noise
towards


less volume User Data Mining,


Learning, and
Recommendation
post-
processing
User Data
Presentation
User Data
Exploration
interaction
User
User Data Mining and Recommendation
SESSION 2
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• One important task in UDA is to understand user behavior.


• Simply put, we’re interested to know “what users have done” by collecting their interactions with data.


• Understanding user behavior bene
fi
ts businesses, as it helps them envision what services to expand in the
future to increase both user satisfaction and revenue.
Understanding user behavior
37
Amazon product recommendation


© measuringu.com
Net
fl
ix movie recommendation


© Medium
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We employ user data for two separate tasks: mining and recommendation.


• Mining


• To understand and represent user behaviors in the captured data.


• A famous application in industry is cross-selling: “customer who bought this


item also bought …”.


• The fundamental assumption is that there exist groups of user activities formed


by like-minded users which constitute different instances of user behavior.


Hence the main action is grouping.


• Recommendation


• To predict future user behaviors in the captured data. Recommendation is great approach for personalization.


• The fundamental assumption is that there exist a latent relation in user interactions, which can also predict future
possible interactions. Hence the main action is relation discovery.
User data mining and recommendation
38
http://cliintel.com/diapers-beer-and-data-in-retail/
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
User data mining
39
• The main action in user data mining is grouping, which is often resided in an unsupervised context.


• We need two elements to group users: a distance function, and representation approach.


• The distance function imposes the grouping / mining semantics. It enforces how two users should / should not
be placed in a common group. Sometimes it is called similarity function.


• The representation approach de
fi
nes how each mined group should be labeled. In the following example,
majority voting is used for representation.
Mia likes 60 drama movies and
40 action movies.
Group of drama-
genre lovers
Group of action-
genre lovers
distance?
distance?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Myriads of grouping methods
40
Community and Clique Detection
[Newman, Physical J.’04]
[Barbieri et al., ICDM’13]
[Goyal et al., CIKM’08]
Team and Tribe Formation
[Nikolaev et al., KDD’16]
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Aged 18−29
Aged 30−44
Aged 45+
Aged under 18
Females
Females Aged 18−29
Females Aged 30−44
Females Aged 45+
Females under 18
IMDb staff
Males
Males Aged 18−29
Males Aged 30−44
Males Aged 45+
Males under 18
Non−US users
Top 1000 voters
US users
0.0
2.5
5.0
7.5
Average
The Social Network, 7.7/10
1 2 3 4 5
The Blair Witch Project (1999)
0.0
0.4
0.8
Population: All, Average: 3
1 2 3 4 5
American Beauty (1999)
0.0
0.4
0.8
Population: All, Average: 4.3
1 2 3 4 5
American Beauty (1999)
0.0
0.4
0.8
Population: Middle-Age, Boston,
Average: 3.17
(a) (b) (c)
gure 1: (a) Segments on IMDb (b) Segments’ Distributions (c) Segments Exploration with Rating Maps
ween the rating distribution of a segment and an input
tribution of interest. Second, a scalable algorithm for
ploring the huge search space and dynamically building
ing maps is imperative. Finally, the segments forming a
p must satisfy certain quality criteria: coverage of input
ing records, diversity in segment description to show dif-
ent facets of the rater population, size of each segment
., not too small), and high proximity of each segment to
input distribution.
n a nutshell this paper makes the following contributions:
1. We show that several sophisticated distance measures
to discriminate between distributions. We show that the
rth Mover’s Distance (EMD) [20] is able to capture subtle
erences between two distributions and is appropriate for
building rating maps. Section 3 performs a study of various
distance measures. In Section 4.2, we discuss DTAlg, along
with the RF heuristics. Our experimental study and findings
are given in Section 5. Related work is discussed in Section 6.
Section 7 summarizes and concludes the paper.
2. DATA MODEL
A rated dataset consists of a set of users with schema
SU , items with schema SI and rating records with schema
SR. For example, SU = huid, age, gender, state, cityi
and a user instance may be hu1, young, male, NY , NYCi.
Similarly, movies on IMDb can be described with SI =
hitem id, title, genre, directori, and the movie Titanic
Segment Discovery
[Amer-Yahia et al., WWW’2017]
Pattern and Cube Mining
[Xin et al., KDD’06]
[Kamat et al., ICDE’14]
Clustering and Partitioning
[Agrawal et al., ACM’1998]
[Pedreira et al., VLDB’16]
Cohort Representation
[Jiang et al., VLDB’16]
[Omidvar-Tehrani, Amer-Yahia,
Lakshmanan @ DSAA’18]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• One common mining approach is clustering.


• K-means is the status quo in clustering which is an iterative expectation-maximization (EM) approach to update
the parameters of each cluster until convergence.


• Cluster centroids are representatives.


• K-means is a hard clustering method.


• K-means clusters are radial.
Clustering: k-means
41
input parameter k


centroids ← k random users


repeat until convergence:


for all users:


find the centroid closest to the user


assign the user to the cluster of that centroid (expectation)


update the centroid (maximization)
return k centroids
© iChrome
k=2 k=3
k=4 k=5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Hard clustering bears no uncertainty.


• In real user data, users often belong to more than one group.


• A generalized non-hard clattering approach is Gaussian Mixture


Models (GMM).


• The idea is to represent each cluster with a Gaussian distribution


in lieu of a centroid. Hence the whole model contains k different


distributions.


• The objective is to maximize the
fi
t between the data points in each


cluster and its representative distribution, using maximum likelihood estimation.
Clustering: Gaussian Mixture Models (GMM)
42
© Oscar Contreras Carrasco


@ towards data science
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Density-based spatial clustering of applications with noise (DB-Scan) is grouping method based on both
distance and minimum number of points. The combination of the two parameters creates a notion of
neighborhood. The resulting clusters are not necessarily radial.
Clustering: DB-Scan
43
© KDnuggets
input parameter d and nbu // d = distance, nbu = minimum number of users


find the users in the d-neighborhood of every user, and identify core users with more than nbu neighbors.


find the connected components of core users on the neighbor graph, ignoring all non-core users.


assign each non-core user to a nearby cluster if the cluster is a d-neighbor, otherwise noise.


return clusters
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Clustering algorithms can also be employed as commodity using high-level Python libraries.


• Among many successful libraries, scikit-learn is a popular and standard one.


• For k-means, given the data and the number of clusters, the library does the rest.


• For DB-Scan, given the data, the distance and the minimum number of users, the library does the rest.
Python libraries for clustering
44
# k-means


from sklearn.cluster import KMeans


import numpy as np


data = np.array([[1, 2], [1, 4], …)


clusters = KMeans(nb_clusters=2).fit(data)


print(clusters.labels_)


#[1, 1, 1, 0, 0, …]


print(clusters.predict([12, 3])


# 0
# DB-Scan


from sklearn.cluster import DBSCAN


import numpy as np


data = np.array([[1, 2], [2, 2], …)


clusters = DBSCAN(eps=3, min_samples=2).fit(data)


print(clusters.labels_)


# [ 0, 0, 0, 1, 1, -1, …]


predictions = clusters.fit_predict(new_data)


# 1
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• From human perspective, the representativity of all previous grouping approaches is feeble.


• As explainability matters in AI (XAI trend), it is desirable to have a soft non-radial grouping method which
represents groups in a human-understandable form, e.g., “group of students who participate in RAIS summer
school.”


• Frequent Itemset Mining (FIM) is often considered as a method for market


basket analysis.


• The initial goal is to
fi
nd sets of products that are frequently bought together.


• Each frequent itemset is a describable group.
Frequent Itemset Mining
45
Some Rep
© S. Harris @ ScienceCartoonsPlus.com
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
FIM: De
fi
nitions
46
• We are given a set of items , where any subset of is an itemset.


• We are also given a transaction (un-tidy) dataset where each member of is an itemset.


• Given an itemset , is the number of transactions containing .


• An itemset is a frequent itemset if , where is the minimum support threshold.


• Given two item sets and , an association rule with con
fi
dence holds, if ( is the
minimum con
fi
dence threshold ), where .
ℐ ℐ
𝒯𝒯
X ⊆ ℐ support(X) X
X ⊆ ℐ support(X) ≥ δ δ
X ⊆ ℐ Y ⊆ ℐ X → Y c c ≥ δ′

δ′

c = (support(X ∪ Y))/(support(X))
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
FIM: Example
47
User watched
u1
User watched
u2
User watched
u3
User watched
u3
User watched
u5
User watched
u6
Transaction user dataset
{The Terminal, Forrest Gump, The Pianist} is a
frequent itemset.


absolute support = 4


relative support = 4/5 = 60%


{Forrest Gump, The Pianist} → {The Terminal} is an
association rule.


con
fi
dence = 4/6 = 66%


{The Pianist} → {The Terminal, Forrest Gump} is
another association rule.


con
fi
dence = 4/5 = 80%
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
FIM: Computation
48
• Apriori algorithm. It is a level-wise search (
fi
rst 1-itemsets, then 2-itemsets, …) which exploits the following
pruning opportunity: if an itemset is not frequent, then all its supersets are not frequent.


• For instance, if {Psycho, Unhinged} is not frequent, then of course {Psycho, Unhinged, The Pianist} won’t be
frequent either.


• For instance, given the minimum support threshold equal to 2, the itemset {young, CA, student} is not frequent,
and not its superset either.
[Agrawal et al., SIGMOD’93]
6 Multi-Objective Group Discovery on the Social Web (Technical Report)
ha2, v2i, . . . , han, vni}, n  k, we say that g covers r, denoted as r l g, i↵
8i 2 [1, n], 9r.vj such that vj is a set of values for attribute g.ai and g.vj ✓
r.vi. For example, the rating hfemale, DC, student, 4i is covered by the group
{hgender, femalei, hlocation, DCi}.
{}
#records=3662
{male, young}
#records=1588
{CA,
student}
#records=20
{male}
#records=2634
{young}
#records=2147
{CA}
#records=664
{student}
#records=184
{male, young,
CA}
#records=268
{male, young, CA, student}
#records=2
{young, CA}
#records=375
{male,
student}
#records=120
{male, CA}
#records=477
{young,
student}
#records=13
{young, CA,
student}
#records=2
{male, young,
student}
#records=13
{male, CA,
student}
#records=17
[Omidvar-Tehrani, Amer-Yahia, Dutot, Trystram @ PKDD’16]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
FIM for mining describable groups of users
49
• We employ an ef
fi
cient implementation of Apriori called LCM for
mining groups in user data.


• Step 1. Identi
fi
ers for both users and items should be mapped to a
non-negative integer space (required by LCM). For instance if the
movie Titanic (as an item) is mapped to “25” and the user “John” is
also mapped to “120”, the tuple <120,25> means that John has
watched the movie Titanic.


• Step 2. We transform a tidy dataset to an un-tidy (transactional)
dataset, where each line represents one user and the whole item IDs
associated to the user will be listed in that line separated by space.


• Step 3. Run LCM to mine groups.


• Each line in the output
fi
le returned by LCM represents one group.
[Takeaki et al., Discovery Science ’04]
http://research.nii.ac.jp/~uno/code/lcm.html
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• With the approach discussed in the previous slides, we can obtain groups solely on the co-occurrence of items.


• It is more desirable to mix demographics and items to obtain groups such as “middle-aged females in Grenoble
who watched The Terminal and Forrest Gump.”


• It is possible to encode user attributes in the same transactional database. Then LCM will give us full-
fl
edged
groups.
Full-
fl
edged behaviors in user data mining
50
user gender age movies watched
u1 F Young Terminal, Forrest., Pianist, Psycho, Unhinged
u2 F Middle Terminal, Forrest., Pianist, Unhinged
u3 M Middle Pianist
u4 F Young Forrest., Pianist
u5 F Middle Terminal, Forrest., Pianist, Psycho
u6 M Middle Terminal, Forrest., Pianist
movie code
Terminal 1
Forrest. 2
Pianist 3
Psycho 4
Unhinged 5
attribute value code
Female 101
Male 102
Young 103
Middle 104
line # Transaction
1 1 2 3 4 5 101 103
2 1 2 3 5 101 104
3 3 102 104
4 2 3 101 103
5 1 2 3 4 101 104
6 1 2 3 102 104
[1 2 3 101 104] (2) [2 5]
[Terminal Forrest. Pianist Female Middle] (2) [u2 u5]
un-tidy LCM
translate
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Hands-on 3: Mining user groups
51
• Step 1. Find MovieLens 1M dataset dataset on a dataset repository and download. The dataset contains movies that
users appreciated watching. We only need the
fi
le ratings.dat.


• Step 2. Download the Python
fi
le pmr.py in the following link, complete it: https://drive.google.com/drive/folders/
1xMxGdcI2IGgTAhozDUqSfZAWzKVXfkjr?usp=sharing.


• Step 3. Run the code to obtain the output
fi
le pmr.txt.


• Step 4. Download LCM software from the following link: https://drive.google.com/drive/folders/
1xMxGdcI2IGgTAhozDUqSfZAWzKVXfkjr?usp=sharing.


• Step 5. Put the dataset
fi
le in the same folder as LCM.


• Step 6. Run LCM as follows:


• Step 7. Open the output
fi
le out.txt. Each line in the
fi
le out.txt represents a group in the following structure: [set of
items] (support) [set of users]. The description of the group is [set of items]. The set of group members is [set of users].


• Step 8. Try to
fi
nd 5 interesting user groups.
./lcm CfI -l 5 -u 100 pmr.txt 3 out.txt
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Question. Following the steps in the previous hands-on, what is the most challenging aspect of mining
groups which remains unsolved?
Poll: Challenge of mining user groups
52
Votes
0
1
2
3
4
Challenges of user data mining
Ef
fi
ciency Overlap Size of clusters Explainability Mechanism Binning
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
User data mining for advanced decision making
53
• Both clustering and frequent itemset mining are based on the idea of density maximization.


• But is density what the end-user really desire to achieve?


• Oftentimes, more quality measures are required, such as coverage, diversity, and variance.
© prototypr.io
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Multi-objective optimization
54
• This makes a multi-objective optimization problem.


• Given set of ratings , identify all group-sets where each group-set satis
fi
es:


• is maximized;


• is maximized;


• is minimized;


• The problem is proved to be NP-Complete by a reduction from the Exact 3-Set Cover problem (EC3).
R G
coverage(G, R)
diversity(G, R)
diameter(G, R)
Ensuring that most input records belong to at least one group in the output.
Ensuring that found groups are as different as possible from each other.
Ensuring that ratings within each group are homogenous.
[Omidvar-Tehrani, Amer-Yahia, Dutot, Trystram @ PKDD’16]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Diameter objective
55
• Diameter is a simple but effective measure of variance in ratings.


• Below, we observe that most reviewers agree on a high score for the movie Godfather → minimum diameter.


• We also observe that the reviewers are divided when voting on Fifty Shades of Grey → maximum diameter.
Count
(%)
0
15
30
45
60
Rating scores
1 2 3 4 5 6 7 8 9 10
Rating Distribution
Other rating distributions like increasing, decreasing, heterogeneous, etc.
Rating distribution of
The Godfather (1972)
in IMDb
Homogeneous
Rating Distribution
Minimum diameter
Count
(%)
0
7.5
15
22.5
30
Rating Scores
1 2 3 4 5 6 7 8 9 10
Rating distribution of
Fifty Shades of Grey (2015)
in IMDb
Polarized Rating
Distribution
Maximum diameter
Count
(%)
0
15
30
45
60
Rating scores
1 2 3 4 5 6 7 8 9 10
Rating Distribution
Other rating distributions like increasing, decreasing, heterogeneous, etc.
Rating distribution of
The Godfather (1972)
in IMDb
Homogeneous
Rating Distribution
Minimum diameter
Count
(%)
0
7.5
15
22.5
30
Rating Scores
1 2 3 4 5 6 7 8 9 10
Rating distribution of
Fifty Shades of Grey (2015)
in IMDb
Polarized Rating
Distribution
Maximum diameter
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Pareto group discovery
56
• A bottom-up exhaustive approach to discover Pareto front.


• Generating fewer plans makes a Multi-Objective optimization algorithm run faster.
Optimization-based User Group Management: Discovery,Analysis, Recommendation - November 6, 2015
Bottom-up exhaustive approach to discover Pareto front.
0.5
10
User Groups as Pareto Fronts
Diversity
0 1
0.5
Coverage
0
1
Candidate Group-set
Dominance Area
Rejected Group-set
Pareto Group-set
α-Dominance Area
α
Rejected Group-set in case of α-
dominance
Bottom-up exhaustive approach to discover Pareto front.
0.5
User Groups as Pareto Fronts
Diversity
0 1
0.5
Coverage
0
1
Candidate Group-set
Dominance Area
Rejected Group-set
Pareto Group-set
α-Dominance Area
α
Rejected Group-set in case of α-
dominance
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
An approximation algorithm for Pareto group discovery
57
1. Inputs are , ,


2. Output is the Pareto result set


3.


4. For all user groups do


1. ← Singleton group-set containing g


2. If is not -dominated by any other group-set , then add to


5. For do


1. For each possible group-set of size do


1. If is not -dominated by any other group-set , then add to


6. Return
k α > 1 R
𝒫𝒫
← ∅
g
G
G α ∈
𝒫
G
𝒫
n ∈ [2,k]
G n
G α ∈
𝒫
G
𝒫𝒫
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Hands-on 4: Mining multi-objective user groups
58
• Step 0. We continue the previous hands-on. So we need the
mined groups.


• Step 1. Download and unzip the
fi
le MOMRI.zip at the
following https://drive.google.com/drive/folders/1M-
HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing. It is
a Java NetBeans project whose main package is
“MOQO.MRI” and whose main executable is MOMRI.java.


• Step 2. Run the algorithm. The output of the algorithm


reports the progress in
fi
nding Pareto plans.


• Step 3. Add a new objective to the optimizer.


• Download the documentation at https://drive.google.com/
fi
le/d/1BE1jL2Lp327_Lxb1MMudY2p6l1tG_Uj4/view?
usp=sharing.
Input data. The parameter “ds” (line 21 of MOMRI.java) specifies the name of the da
use. MovieLens 1M (ds=“ml1m”) is considered as the default dataset. You can also t
MovieLens 100K dataset (ds=“ml100k”). The method “read ratings()” in line 30 of
MOMRI.java reads ratings from the data file on disk. The data file is hosted in the “da
Executable file
Parameters
Output
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Recommendation systems
59
• Recommendation systems are designed to automatically
fi
nd relevant and desirable items to be consumed
by users in the future.


• In general, those systems work by means of predicting items that are likely to be the most appealing to
users based on their preferences.


• Intuitively, the problem of recommendation reduces to
fi
lling missing values in the user-item interaction
matrix.
[Amer-Yahia and Benouaret, BigData’20]
Terminal Forrest. Pianist Psycho Unhinged
u1 5 4 5 4 3
u2 4 5 5
u3 4
u4 3 3
u5 3 2 3 2
u6 3 4 2
Question. How would u2 rate the movie Psycho in the future?


Answer. Probably like others users similar to u2, like u1 or u5.


Question. Is u2 more similar to u1 or u5?


Answer. Following their ratings for The Terminal, Forrest Gump,
and The Pianist, u2 is more similar to u1. Hence u2 would
probably rate Psycho around 4, like what u1 did.
Multi-scale rating user-item interaction matrix
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Types of recommendation systems
60
• Rule-based approaches used to be the dominant method for recommendation. It is still used in industry.


• Most common state-of-the-art approaches are content-based
fi
ltering and collaborative
fi
ltering.


• Content-based
fi
ltering recommends items based on ones that the user liked before.


• Collaborative
fi
ltering recommend items which are popular among the neighbors of the user.
Nina likes 60 drama movies, 20 romance, and 20 action.
La Vie en Rose


(Biography, Drama)
Me before You


(Drama, Romance)
Memento
(Mystery, Thriller)
60% sim. 0% sim.
80% sim.
“Me before You” will be ranked higher than “La Vie en
Rose” in Nina’s content-based recommendation.
Nina’s taste overlaps with Stephanie and Charles.
more impact less impact
“La Vie en Rose” will be ranked higher than “Memento” in
Nina’s collaborative recommendation.
CONTENT-BASED
COLLABORATIVE
Stephanie has the same taste
as Nina and likes “La Vie en
Rose” more than “Me before
You”.
Charles ’s taste is somewhat
different form Nina’s, and he
likes “Memento” more than
“Me before You”.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Collaborative
fi
ltering
61
• As collaborative
fi
ltering (CF) captures “like-minded behaviors”, it is often a favorite recommendation option.


• Two methods are proposed for implementing a CF approach: memory-based and model-based.


• In a memory-based implementation, the entire user-item interaction matrix is employed.


• In a model-based implementation, a model of users is developed to learn their preferences.
Towards more simplicity
Towards more ef
fi
ciency
model-based memory-based
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Similarity between users
62
• An important step in recommendation is to to compare all users to the input user and
fi
nd the one that is most
similar.


• This is done using Pearson correlation.


• To measure the similarity between the tastes of Sara and Anderson, let’s assume x is the taste vector fo Sara and
y is Anderson’s, both rating n movies.


• The value r could be in the range -1 to +1, where +1 means that Sara and Anderson have perfectly similar tastes,
and -1 means the opposite.


• In practice, this correlation cannot be computed for any single user, hence we often user a small sample.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Memory-based CF: user-based
• Common implementations are user-based and item-based. We practice the former.
63
import pandas as pd


import numpy as np


movies_df, ratings_df = read_data(…)


user_preferences = pd.DataFrame()


user_subset = ratings_df[ratings_df["movie_id"].isin(user_preferences["movie_id"].tolist())]


user_Subset_group = userSubset.groupby(["user_id"])


user_Subset_group = sorted(user_subset_group, key=lambda x: len(x[1]), reverse=True)


user_subset_group = user_subset_group[0:100]


pearson_correlation_dict = {}


for name, group in user_subset_group:


pearson_correlation_dict[name] = pearson_correlation(user_preferences, group)


top_users = pearson_correlation_dict.sort()[0:50].merge(ratings_df)


top_users_rating["weighted_rating"] = top_users_rating["sim"] * top_users_rating["rating"]


recommendation_df = top_users_rating.groupby("movie_id").sum()[["sim","weighted_rating"]]


recommendation_df.average().sort()


final_rec = movies_df.loc[movies_df["movieId"].isin(recommendation_df.head(10)["movieId"].tolist())]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Task. We are given a list of liked movies. Provide top-10 recommendations.


• Download the Python
fi
le cf.py in the following link, and complete it: https://drive.google.com/drive/folders/1M-
HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing.


• Missing parts are marked with a TODO comment.
Hands-on 5: Memory-based collaborative
fi
ltering
64
© streamingclarity.com
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Model-based CF: matrix factorization
65
• CF as a “neighborhood” method, focusing on maximizing “closeness”, does not handle scalability issues and
noise.


• CF performs on low-level (raw) data which does not capture well the similarities between users on higher
levels.


• Matrix Factorization is a solution for both aforementioned issues.


• Factorization is a simple but principle operator in mathematics, e.g., representing “12” with its factors, which
are “4’ and “3”.


• In the context of recommendation, it is the task of factorizing the user-item interaction matrix into two
matrices corresponding to users and items.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Singular Value Decomposition for Matrix Factorization
66
• Among different ways of factorizing matrices, Singular Value Decomposition (SVD) is of particular interest in the
recommendation domain.


• SVD is an algorithm that decomposes an interaction matrix R into into the “best” lower rank approximation of R.


• The main SVD equation is as follows: , where is the diagonal matrix of singular values (weights).
R = QΣPT
Σ
© CodingFox
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Model-based CF with SVD
• To get the lower rank approximation, we employ SVD and maintain the top k latent features, which are the most
important underlying taste.


• For illustration purposes, we consider k = 2, but k ~ 50 is more natural.
67
import pandas as pd


import numpy as np


from scipy.sparse.linalg import svds


# step 1


ratings_df, users_df, movies_df = get_data(…)


# step 2


ratings_pivot_df = ratings_df.pivot()


U, sigma, Vt = svds(ratings_pivot_df, k = 2)


sigma = np.diag(sigma)


# step 3


predictions = np.dot(np.dot(U, sigma))


Terminal Forrest. Pianist
u1 4.5 3 ??
u2 5 5 2
Step 1 (original dataset)
f1 f2
u1 1.1 2.3
u2 2.1 1
f1 f2
Terminal 1.9 1
Forrest. 2.3 0
Pianist 0 2
Terminal Forrest. Pianist
f1 1.9 2.3 0
f2 1 0 2
Matrix U Matrix V Matrix Vt
Terminal Forrest. Pianist
u1 4.39 2.53 4.6
u2 4.99 4.83 2
Step 3 (reconstructed dataset)
Step
2
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Deep learning for recommendation
68
• So far, we covered neighborhood and matrix factorization methods for recommendation.


• For more ef
fi
ciency and precision, we also look at deep approaches, i.e., the active trend in recommendation.


• Deep learning has hunger for data, hence we often user implicit-feedback data rather than explicit-feedback.
Terminal Forrest. Pianist Psycho Unhinged
u1 5 4 5 4 3
u2 4 5 5
u3 4
u4 3 3
u5 3 2 3 2
u6 3 4 2
Explicit-feedback interaction matrix
Terminal Forrest. Pianist Psycho Unhinged
u1 1 1 1 1 1
u2 1 1 1 0 0
u3 0 0 1 0 0
u4 0 1 1 0 0
u5 1 1 1 1 0
u6 1 1 1 0 0
Implicit-feedback interaction matrix
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Neural Collaborative Filtering (NCF)
69
• We employ a simple but ef
fi
cient implementation of a deep neural network for recommendation, called Neural
Collaborative Filtering (NCF).
import pandas as pd


import numpy as np


import torch.nn as nn


ratings = read_data()


# make the algorithm scalable


ratings = filter_to(ratings, 0.1)


train_ratings, test_ratings = split_train_test(ratings)


# mark all seen data as “1” and …


# … pick a few negative examples


users, items, labels = make_implicit_data(train_ratings)


model = NCF(num_users, num_items, train_ratings, movies)


trainer = trainer(max_epochs=5)


trainer.fit(model)
[He et al. ArXiv’17]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Neural Collaborative Filtering (NCF)
69
• We employ a simple but ef
fi
cient implementation of a deep neural network for recommendation, called Neural
Collaborative Filtering (NCF).
import pandas as pd


import numpy as np


import torch.nn as nn


ratings = read_data()


# make the algorithm scalable


ratings = filter_to(ratings, 0.1)


train_ratings, test_ratings = split_train_test(ratings)


# mark all seen data as “1” and …


# … pick a few negative examples


users, items, labels = make_implicit_data(train_ratings)


model = NCF(num_users, num_items, train_ratings, movies)


trainer = trainer(max_epochs=5)


trainer.fit(model)
# step 1


random_users = np.random.choice(ratings['user_id'].unique(),
size=int(len(ratings['user_id’].unique()) * 0.1), replace=False)


# step 2


ratings = ratings.loc[ratings[‘user_id'].isin(random_users)]
[He et al. ArXiv’17]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Neural Collaborative Filtering (NCF)
69
• We employ a simple but ef
fi
cient implementation of a deep neural network for recommendation, called Neural
Collaborative Filtering (NCF).
import pandas as pd


import numpy as np


import torch.nn as nn


ratings = read_data()


# make the algorithm scalable


ratings = filter_to(ratings, 0.1)


train_ratings, test_ratings = split_train_test(ratings)


# mark all seen data as “1” and …


# … pick a few negative examples


users, items, labels = make_implicit_data(train_ratings)


model = NCF(num_users, num_items, train_ratings, movies)


trainer = trainer(max_epochs=5)


trainer.fit(model)
# step 1


random_users = np.random.choice(ratings['user_id'].unique(),
size=int(len(ratings['user_id’].unique()) * 0.1), replace=False)


# step 2


ratings = ratings.loc[ratings[‘user_id'].isin(random_users)]
# step 1


ratings['rank_latest'] = ratings.groupby(['user_id'])
['timestamp'].rank(method='first', ascending=False)


# step 2


train_ratings = ratings[ratings['rank_latest'] != 1]


test_ratings = ratings[ratings['rank_latest'] == 1]
[He et al. ArXiv’17]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NCF architecture
70
• Akin to the notion of latent factors in MF, the input to the network is user
and item embeddings.
class NCF():




def __init__() …




def forward(self, user_input, item_input):


user_embedded = self.user_embedding(user_input)


item_embedded = self.item_embedding(item_input)


vector = torch.cat([user_embedded, item_embedded], dim=-1)


vector = nn.ReLU()(self.fc1(vector))


vector = nn.ReLU()(self.fc2(vector))


pred = nn.Sigmoid()(self.output(vector))


return pred




def training_step(self, batch):


user_input, item_input, labels = batch


predicted_labels = self(user_input, item_input)


loss = nn.BCELoss()(predicted_labels)


return loss


def configure_optimizers(self):


return torch.optim.Adam(self.parameters())


def train_dataloader(self):


return DataLoader(ratings, batch_size=512, num_workers=4)
© James Loy @ Kaggle
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Group recommendation
71
• The outcome of a typical recommendation engine is a personalized top-k recommendation list.


• What if a group of users want to receive recommendations that they all appreciate collectively?


• A naïve approach towards group recommendation is the creation of a virtual user.
Predictions for Olivia:


rating(“Me before You”) = 1
rating(“Memento”) = 3
Predictions for Julia:


rating(“Me before You”) = 1
rating(“Memento”) = 1
Julia
Olivia Jacob
Predictions for Jacob:


rating(“Me before You”) = 5


rating(“Memento”) = 3
Question. Which movie should the group watch together?


Answer. Consider them as a virtual user with average rating.


Question. The average for both movies will become 2.33!! Alternative?


Answer. Consider them as a virtual user with least misery.


Question. The least misery score for both is 1!! Alternative?


Answer. …!
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Solving the group recommendation problem
72
• Problem. Given user group , return best items to recommend (denoted as ) to during period such that


• contains items.


• Every item in is new to all members of .


• There does not exist any other item whose score is higher than any item in .


• Solution. A top-k processing algorithm is proposed.


• We materialize lists such as static af
fi
nity, absolute preference and dynamic af
fi
nity, and then scan all lists in
round-robin fashion (like NRA) followed by a buffer update.


• We terminate using a stopping condition.
G k IG G p
IG k
IG G
IG
[Basu Roy et al., VLDBJ’10 and ICDE’14]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
Top-k processing
73
• Top- processing is a series of algorithms with the aim of
fi
nding items that best answer a user’s query.


• The performance of a top-k processing algorithm is measured in


terms of number of sequential accesses (SAs) and random accesses


(RAs) it makes.


• For instance, you access your third favorite music on an audio tape


using an SA, and on Spotify (or essentially or hard drive) using an RA.


• The naïve computation of top-k is to compute the score of each item,


sort them in decreasing order, and return the top-k. When we have billions of items, this approach is infeasible.


• An alternative idea is to throw space at the problem, by pre-computing inverted lists and scanning them, with a
stopping condition.


• Famous algorithms in this genre are TA and NRA. We review the latter here.
k k
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
No-Random-Access (NRA) algorithm
74
• Access all lists sequentially and in parallel.


• After each cursor move compute


• Worst-case score , best-case score for each seen ( is an item, e.g., a movie or a book)


• Sort all seen items on ,breaking ties by


• if then


• add to buffer


•


• else if


• add to candidates


• Stop if candidates


• Return the top- items
W(r) B(r) r r
W(r) B(r)
W(r) > mink
r
mink = min(W(r′

) ∀r′

∈ B)
B(r) > mink
r
B(d′

) ≤ mink ∀d′

∈
k
Predictions for
Julia
Titanic, 1
Terminal, 0.2
Predictions for
Jacob
God Father, 3.3
Titanic, 1.4
Predictions for
Olivia
Titanic, 2.3
God Father, 0.1
…
1
2
1
2 …
Sequential
access
(SA)
Random access (RA)
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 1)
75
• We initialize cursors at the head of each list. We assume (hence the buffer size) and we have
space to keep track of 10 candidates.
k = 3
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
0 SA’s have been performed hitherto.
mink = ?
[jump to end of this example]
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 2)
76
• We move the cursors sequentially.


• We complete the buffer by adding movies r7, r1, and r2 to it.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
3 SA’s have been performed hitherto.
mink = ?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 2)
76
• We move the cursors sequentially.


• We complete the buffer by adding movies r7, r1, and r2 to it.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
3 SA’s have been performed hitherto.
mink = ?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 2)
76
• We move the cursors sequentially.


• We complete the buffer by adding movies r7, r1, and r2 to it.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
3 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 2)
76
• We move the cursors sequentially.


• We complete the buffer by adding movies r7, r1, and r2 to it.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
3 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 3)
77
• Once the buffer is complete, we check whether a new movie is worth to be added to the buffer.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
4 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 3)
77
• Once the buffer is complete, we check whether a new movie is worth to be added to the buffer.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
4 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 3)
77
• Once the buffer is complete, we check whether a new movie is worth to be added to the buffer.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
4 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
, should it be added to
the buffer?
worsecase(r3) = 4.5
bestcase(r3) = 6
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 3)
77
• Once the buffer is complete, we check whether a new movie is worth to be added to the buffer.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
4 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r1 1.5 6.5
3 r2 1.5 6
Candidates
4
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
, should it be added to
the buffer?
worsecase(r3) = 4.5
bestcase(r3) = 6
Given that , then
YES!
worsecase(r3) > mink
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 4)
78
• Some items will gradually transition from the buffer to the candidates (e.g., r2).
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
5 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r3 4.5 6
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 4)
78
• Some items will gradually transition from the buffer to the candidates (e.g., r2).
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
5 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r3 4.5 6
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 4)
78
• Some items will gradually transition from the buffer to the candidates (e.g., r2).
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
5 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r7 5 6.5
2 r3 4.5 6
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5 gets updated but stays at 1.5.
mink
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 5)
79
• We have to check the stopping condition after each SA.


• We stop if .
max(bestcase(candidates)) < mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
5 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
max
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 5)
79
• We have to check the stopping condition after each SA.


• We stop if .
max(bestcase(candidates)) < mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
5 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
max
⚠ Should we stop? 6 1.5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 6)
80
• For any new movie, we check if it should be added to the buffer. After the buffer update, we check the
stopping condition.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
6 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
⚠ Should we stop? 6 1.5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 6)
80
• For any new movie, we check if it should be added to the buffer. After the buffer update, we check the
stopping condition.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
6 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
⚠ Should we stop? 6 1.5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 6)
80
• For any new movie, we check if it should be added to the buffer. After the buffer update, we check the
stopping condition.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
6 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
, should it be added to
the buffer?
worsecase(r4) = 4.5
bestcase(r4) = 5.75
⚠ Should we stop? 6 1.5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 6)
80
• For any new movie, we check if it should be added to the buffer. After the buffer update, we check the
stopping condition.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
6 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r1 1.5 6.5
Candidates
4 r2 1.5 6
5
6
7
8
9
10
mink = ?
mink = 1.5
Given and
, should it be added to
the buffer?
worsecase(r4) = 4.5
bestcase(r4) = 5.75
Given that , then
YES!
worsecase(r4) > mink
⚠ Should we stop? 6 1.5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 7)
81
• We update after any buffer update.
mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
6 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
mink = ?
mink = 1.5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 7)
81
• We update after any buffer update.
mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
6 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
mink = ?
mink = 1.5
mink = 4.5
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
NRA example (step 7)
81
• We update after any buffer update.
mink
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list
position movie worse-case score best-case score
Buffer
1
2
3
Candidates
4
5
6
7
8
9
10
6 SA’s have been performed hitherto.
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
mink = ?
mink = 1.5
mink = 4.5
⚠ Should we stop? 6.5 4.5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
NRA example (step 8)
82
• We move the cursors sequentially.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 7 SA’s have been performed hitherto.
mink = ?
mink = 5
⚠ Should we stop? 6.5 5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
NRA example (step 8)
82
• We move the cursors sequentially.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 7 SA’s have been performed hitherto.
mink = ?
mink = 5
⚠ Should we stop? 6.5 5, then NO!
≮
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r7 5 6.5
3 r4 4.5 5.75
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
NRA example (step 8)
82
• We move the cursors sequentially.
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 7 SA’s have been performed hitherto.
mink = ?
mink = 5
⚠ Should we stop? 6.5 5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r4 5.5 5.5
3 r7 5 6.5
Candidates
4 r1 1.5 6.5
5 r2 1.5 6
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 9)
83
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 8 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 9)
83
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 4.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 8 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Some movies may not be worth to be added neither to the buffer nor to the candidates.
NRA example (step 10)
84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 9 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Some movies may not be worth to be added neither to the buffer nor to the candidates.
NRA example (step 10)
84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 9 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Some movies may not be worth to be added neither to the buffer nor to the candidates.
NRA example (step 10)
84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 9 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Some movies may not be worth to be added neither to the buffer nor to the candidates.
NRA example (step 10)
84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 9 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25
Given that , then
NO!
worsecase(r5) ≯ mink
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Some movies may not be worth to be added neither to the buffer nor to the candidates.
NRA example (step 10)
84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 9 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25
Given that , then
NO!
worsecase(r5) ≯ mink
Can we still keep it as a candidate?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• Some movies may not be worth to be added neither to the buffer nor to the candidates.
NRA example (step 10)
84
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 9 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 1.5 6.5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r5) = 1
bestcase(r5) = 5.25
Given that , then
NO!
worsecase(r5) ≯ mink
Can we still keep it as a candidate? Given that , then NO!
bestcase(r5) ≯ mink
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 11)
85
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 10 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 11)
85
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 10 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 12)
86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 11 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 12)
86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 11 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 12)
86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 11 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 12)
86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 11 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3
Given that , then
NO!
worsecase(r6) ≯ mink
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 12)
86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 11 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3
Given that , then
NO!
worsecase(r6) ≯ mink
Can we still keep it as a candidate?
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 12)
86
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 11 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Given and
, should it be added to
the buffer?
worsecase(r6) = 0.8
bestcase(r6) = 4.3
Given that , then
NO!
worsecase(r6) ≯ mink
Can we still keep it as a candidate? Given that , then NO!
bestcase(r6) ≯ mink
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 13)
87
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 12 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 13)
87
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 12 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 13)
87
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 12 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
We won’t add the movie r9 neither to
buffer nor to the candidates.
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 14)
88
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 13 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event
• We move the cursors sequentially.
NRA example (step 14)
88
movie
predicted
rating
r1 1.5
r2 1.5
r3 1.25
r4 1
r5 1
r6 0.8
r7 0.75
r8 0.75
r9 0.5
r10 0.5
Jacob’s inverted list
movie
predicted
rating
r7 5
r3 4.5
r4 4.5
r2 5.25
r1 3.5
r9 3
r5 2
r6 1
r8 0.5
r10 0.5
Julia’s inverted list 13 SA’s have been performed hitherto.
mink = ?
mink = 5.5
⚠ Should we stop? 6.5 5.5, then NO!
≮
position movie worse-case score best-case score
Buffer
1 r3 5.75 5.75
2 r2 5.75 5.75
3 r4 5.5 5.5
Candidates
4 r7 5 6.5
5 r1 5 5
6
7
8
9
10
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data
Exploratory Analysis of User Data

More Related Content

What's hot

SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...Micah Altman
 
Social Network Analysis (Part 1)
Social Network Analysis (Part 1)Social Network Analysis (Part 1)
Social Network Analysis (Part 1)Vala Ali Rohani
 
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBOA COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBOijaia
 
Development of Southern Luzon State University Digital Library of Theses and ...
Development of Southern Luzon State University Digital Library of Theses and ...Development of Southern Luzon State University Digital Library of Theses and ...
Development of Southern Luzon State University Digital Library of Theses and ...IRJET Journal
 
Understanding the Big Data Enterprise
Understanding the Big Data EnterpriseUnderstanding the Big Data Enterprise
Understanding the Big Data EnterprisePhilip Bourne
 
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...Anna De Liddo
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
 
#lak2013, Leuven, DC slides, #learninganalytics
#lak2013, Leuven, DC slides, #learninganalytics#lak2013, Leuven, DC slides, #learninganalytics
#lak2013, Leuven, DC slides, #learninganalyticsSoudé Fazeli
 
Social and Physical Sensing Enabled Decision Support for Disaster Management ...
Social and Physical Sensing Enabled Decision Support for Disaster Management ...Social and Physical Sensing Enabled Decision Support for Disaster Management ...
Social and Physical Sensing Enabled Decision Support for Disaster Management ...Artificial Intelligence Institute at UofSC
 
Research Portfolio - Josh LaMar
Research Portfolio - Josh LaMarResearch Portfolio - Josh LaMar
Research Portfolio - Josh LaMarJosh LaMar
 

What's hot (12)

SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
SOLARSPELL: THE SOLAR POWERED EDUCATIONAL LEARNING LIBRARY - EXPERIENTIAL LEA...
 
Social Network Analysis (Part 1)
Social Network Analysis (Part 1)Social Network Analysis (Part 1)
Social Network Analysis (Part 1)
 
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBOA COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
A COMPREHENSIVE STUDY ON DATA EXTRACTION IN SINA WEIBO
 
Development of Southern Luzon State University Digital Library of Theses and ...
Development of Southern Luzon State University Digital Library of Theses and ...Development of Southern Luzon State University Digital Library of Theses and ...
Development of Southern Luzon State University Digital Library of Theses and ...
 
Understanding the Big Data Enterprise
Understanding the Big Data EnterpriseUnderstanding the Big Data Enterprise
Understanding the Big Data Enterprise
 
nm
nmnm
nm
 
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
The Evidence Hub: Harnessing the Collective Intelligence of Communities to Bu...
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
 
#lak2013, Leuven, DC slides, #learninganalytics
#lak2013, Leuven, DC slides, #learninganalytics#lak2013, Leuven, DC slides, #learninganalytics
#lak2013, Leuven, DC slides, #learninganalytics
 
Social and Physical Sensing Enabled Decision Support for Disaster Management ...
Social and Physical Sensing Enabled Decision Support for Disaster Management ...Social and Physical Sensing Enabled Decision Support for Disaster Management ...
Social and Physical Sensing Enabled Decision Support for Disaster Management ...
 
Research Portfolio - Josh LaMar
Research Portfolio - Josh LaMarResearch Portfolio - Josh LaMar
Research Portfolio - Josh LaMar
 
014 Hideaki Takeda2
014 Hideaki Takeda2014 Hideaki Takeda2
014 Hideaki Takeda2
 

Similar to Exploratory Analysis of User Data

Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsIRJET Journal
 
Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 20140203Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 201402033 Round Stones
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Qazi Maaz Arshad
 
Data Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfData Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfmustaq4
 
Data carpentry ndic-2015-05-05
Data carpentry ndic-2015-05-05Data carpentry ndic-2015-05-05
Data carpentry ndic-2015-05-05tracykteal
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsBrand Niemann
 
Combining analytics and user research
Combining analytics and user researchCombining analytics and user research
Combining analytics and user researchAlex Tarling
 
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis Jari Jussila
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...BigData_Europe
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactElena Simperl
 
AMASED: Access methods for analysing sensitive data
AMASED: Access methods for analysing sensitive dataAMASED: Access methods for analysing sensitive data
AMASED: Access methods for analysing sensitive dataJisc
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesKimberley Mitchell
 
Homespun UX: Going Beyond Web Analytics
Homespun UX: Going Beyond Web AnalyticsHomespun UX: Going Beyond Web Analytics
Homespun UX: Going Beyond Web AnalyticsMary Ann Brody
 
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...LIBER Europe
 
SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers SoundSoftware ac.uk
 
Digital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swanseaDigital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swanseaMartin Bazley
 
IRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systemsIRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systemsIRJET Journal
 
Facing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and RisksFacing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and RisksLizLyon
 
User Experience Research
User Experience ResearchUser Experience Research
User Experience ResearchSushmita Dutt
 

Similar to Exploratory Analysis of User Data (20)

Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
 
Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 20140203Linked Data Overview - structured data on the web for US EPA 20140203
Linked Data Overview - structured data on the web for US EPA 20140203
 
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
Cse443 Project Report - LPU (Modern Big Data Analysis with SQL Specialization)
 
Data Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfData Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdf
 
Data carpentry ndic-2015-05-05
Data carpentry ndic-2015-05-05Data carpentry ndic-2015-05-05
Data carpentry ndic-2015-05-05
 
Department of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data DashboardsDepartment of Commerce App Challenge: Big Data Dashboards
Department of Commerce App Challenge: Big Data Dashboards
 
Combining analytics and user research
Combining analytics and user researchCombining analytics and user research
Combining analytics and user research
 
Big social data analytics - social network analysis
Big social data analytics - social network analysis Big social data analytics - social network analysis
Big social data analytics - social network analysis
 
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
SC6 Workshop 1: Big Data Europe platform requirements and draft architecture:...
 
Visual analytics
Visual analyticsVisual analytics
Visual analytics
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impact
 
AMASED: Access methods for analysing sensitive data
AMASED: Access methods for analysing sensitive dataAMASED: Access methods for analysing sensitive data
AMASED: Access methods for analysing sensitive data
 
Predictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use CasesPredictive Analytics: Context and Use Cases
Predictive Analytics: Context and Use Cases
 
Homespun UX: Going Beyond Web Analytics
Homespun UX: Going Beyond Web AnalyticsHomespun UX: Going Beyond Web Analytics
Homespun UX: Going Beyond Web Analytics
 
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
Roadmaps, Roles and Re-engineering: Developing Data Informatics Capability in...
 
SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers SoundSoftware: Software Sustainability for audio and Music Researchers
SoundSoftware: Software Sustainability for audio and Music Researchers
 
Digital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swanseaDigital learning martin bazley gem conference swansea
Digital learning martin bazley gem conference swansea
 
IRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systemsIRJET-Model for semantic processing in information retrieval systems
IRJET-Model for semantic processing in information retrieval systems
 
Facing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and RisksFacing the Data Challenge: Institutions, Disciplines, Services and Risks
Facing the Data Challenge: Institutions, Disciplines, Services and Risks
 
User Experience Research
User Experience ResearchUser Experience Research
User Experience Research
 

Recently uploaded

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 

Recently uploaded (20)

KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 

Exploratory Analysis of User Data

  • 1. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Exploratory Analysis of User Data Behrooz Omidvar-Tehrani Research Scientist at Grenoble AI Institute http://www.omidvar.info Intensive course in RAIS summer school, 17-19 May 2021
  • 2. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Behrooz Omidvar-Tehrani, PhD in Computer Science and Applied Mathematics • Research focus on interactive data analysis, at the crossroad of machine leaning, data science, and data mining. About the instructor 2 Postdoctoral Researcher at The Ohio State University 2016-2017 Postdoctoral Researcher at The Grenoble Alpes University 2017-2018 Research Scientist at Naver Labs Europe 2019-2020 Research Scientist at Grenoble AI Institute 2021-Present
  • 3. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Why user data? • Because user data is ubiquitous. • Users are very active on the Web generating user data. • Here is what has happened in last 5 minutes on the Web (per http://pennystocks.la/internet-in-real-time): 3 3M new tweets posted in Twitter 24M videos watched in Youtube 274K photos uploaded in Instagram 8M photos liked in Instagram 22M searches performed in Google 16M posts added in Facebook 12M messages sent in WhatsApp 51K video hours watched in Net fl ix 1M users participated in a Zoom call
  • 4. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Hunger for user data • The number of requests to obtain user data has 
 been increased drastically. • Google received 48,941 government data requests 
 affecting 83,345 user accounts in the fi rst six months 
 of 2017. The United States issued 16,823 of these 
 requests. • Dataset Search indexes almost 25 million 
 user datasets. (https://blog.google/products/ 
 search/discovering-millions-datasets-web/) 4
  • 5. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Why analyzing user data? • In general, data analysis means to “collect data” and “provide insights”. • User data analysis means to extract value from user data → behavioral analytics • It unveils insights into the behavior of customers. 5 Net fl ix movie recommendation © UX Collective Amazon product recommendation © MagePlaza Analytical dashboards for business insights © Marketing Land Automated medical analysis © 123 RF [Omidvar-Tehrani and Amer-Yahia, TKDE’19]
  • 6. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • User data is voluminous and noisy, hence hard to get insights from. • Often an analysis pipeline is designed to tackle the challenges of volume and noise. • We often call it in its abbreviated form as UDA pipeline. • Why post-processing? 
 Because mined results and recommendations need to be rendered in a human-understandable form. • Why user data presentation? 
 When digesting the insights, the human brain performs better on visual elements than on textual information. • Why user data exploration? 
 An exhaustive scan through all discovered groups is not possible for users. User data analysis pipeline 6 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User [Omidvar-Tehrani, Amer-Yahia, Simon @ HILDA’19]
  • 7. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event User roles in UDA pipelines • Users with different roles and needs write UDA pipelines to achieve tasks. 7 Data scientist Domain expert Information consumer who brings analysis expertise who brings domain knowledge who brings task
  • 8. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Objectives and the timeline of the course 8 Objectives • Motivate UDA and UDA pipelines and illustrate its importance in practice • Understand the underlying structure of user data in its general form • Walk through the UDA pipelines and discuss its components, from preparation to exploration • Work on hands-on experiences to observe the challenges of UDA implementation in practice • Get familiar with the state of the art in UDA research Timeline • Session 1. Monday 17 May 2021 at 10:30 - 12:30 (Introduction, User Data Preparation and Visualization) • Session 2. Tuesday 18 May 2021 at 10:30 - 12:30 (User Data Mining and Recommendation) • Session 3. Wednesday 19 May 2021 at 10:30 - 12:30 (User Data Exploration with Reinforcement Learning)
  • 9. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Topics covered in the course 9 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User SESSION 1 SESSION 2 SESSION 1 SESSION 3
  • 10. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Topics covered in the course 9 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User SESSION 1 SESSION 2 SESSION 1 SESSION 3 What is the general model behind all user datasets? How to prepare user data for analysis? How to increase the quality of user data?
  • 11. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Topics covered in the course 9 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User SESSION 1 SESSION 2 SESSION 1 SESSION 3 What is the general model behind all user datasets? How to prepare user data for analysis? How to increase the quality of user data? How to make sense out of user data? How to discuss user data with collaborators?
  • 12. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Topics covered in the course 9 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User SESSION 1 SESSION 2 SESSION 1 SESSION 3 What is the general model behind all user datasets? How to prepare user data for analysis? How to increase the quality of user data? How to discover (mine) insights in user data? How to build a recommender engine for user data? How to recommend to a group of users? How to make sense out of user data? How to discuss user data with collaborators?
  • 13. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Topics covered in the course 9 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User SESSION 1 SESSION 2 SESSION 1 SESSION 3 What is the general model behind all user datasets? How to prepare user data for analysis? How to increase the quality of user data? How to discover (mine) insights in user data? How to build a recommender engine for user data? How to recommend to a group of users? How to make sense out of user data? How to discuss user data with collaborators? How to build interactive user data analysis systems? How to learn interactions with user data? How to guide users in labor-intensive tasks?
  • 14. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event This course is interactive. You participate in 10 polls throughout the course. Course material 10 Hands-on experiences Some code templates will be delivered at the end of each session to practice the learned material. Course slides Available at http://www.omidvar.info/#activities (“teaching”section) Questions Please use during the sessions. For all other questions, email me at behrooz@omidvar.info.
  • 15. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event About exercises 11 Hands-on #1: Research paper fi nder Practicing data crawling and data collection Requirement: Python Hands-on #2: D3 histogram Practicing user data visualization Requirement: Java Script and HTML Hands-on #3: Mining user groups Practicing user data mining and itemset mining Requirement: Python, basic C, basic cmd Hands-on #4: Multi-objective mining Practicing multi-objective optimization Requirement: Java Hands-on #5: Recommendation Practicing recommendation algorithms Requirement: Python Hands-on #6: Implementing exploration semantics Practicing data / problem modeling Requirement: Math and Logic Hands-on #7: Designing a Markov Decision Process Practicing Markov Decision Processes Requirement: Math and Logic Hands-on #8: RL for Exploratory User Data Analysis Practicing reinforcement learning Requirement: Python
  • 16. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Question. You are a data scientist in a company owning terabytes of user data. They ask you to deliver some good insights about their data but they don’t have any speci fi c questions to ask (or any hypotheses to form). They only give you one week to deliver results. How do you prioritize your actions? Poll: Prioritizing actions in user data analysis 12 A5 5 % A4 5 % A3 25 % A2 30 % A1 35 % • Popular answers • (A1) I start cleaning the data, building a visualization dashboard, and present some insights using the dashboard. • (A2) I prepare the data for exploration and ask the data owners to navigate in the data and evaluate some hypotheses. • (A3) I don't start the implementation, and I'll fi rst think on the paper for a bit, in order to come up with a good pipeline plan. • (A4) I start performing some predictions on the raw data, following some post-processing steps. • (A5) I will perform some mining on the raw data, following some post- processing steps. Votes
  • 17. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event 13 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User User Data Preparation and Visualization SESSION 1
  • 18. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • User data is a (complex) bipartite graph between the set of users and the set of items . • Attributes describe both users and items. 𝒰 ℐ 𝒜 User data model 14 User demographics gender age occupation location health status Users 𝒰 Items ℐ movie medicine grocery music book tweet action Temporal actions [Omidvar-Tehrani, Amer-Yahia @ TKDE’20]
  • 19. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Users are not independent entities and they are connected through social links. • Social links can be explicit (friendship in Facebook, following Twitter, co-authorship), or implicit (like-minded users). Links between users 15 Mary and John are explicitly linked through their friendship in Facebook. Mary is a female engineer. John is a male student. Elena and Amber are implicitly linked through their interest in drama-genre movies. Elena is a female professor. Amber is a female pianist. Elena likes The Godfather (Crime, Drama). Amber likes Titanic (Romance, Drama).
  • 20. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • The simple bipartite structure of user data contains many pieces of useful information. Simple data structure but rich value 16 Amber is a female pianist. Amber likes Titanic. Item attributes. Titanic is produced in 1997 by James Cameron, starring Leonardo DiCaprio and Kate Winslet. Action attributes. Amber like the movie Titanic on 17 May 2021, at 3365 Indiana Street, San Diego, USA. User groups. Amber belongs to the group of female pianists in California with 34K members. Abstract user groups. Amber also belongs to the group of females, the groups of pianists, the group of Californians, and the group of Titanic lovers. Abstract user attributes. Amber is also an artist.
  • 21. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • User data preparation is the process of preparing (raw) user data for UDA. • The outcome of user data preparation is another version of user data with less noise. User data preparation 17 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User Extract, Transform, Load (ETL) User Data Ingestion User Data Integration User Data Cleaning User Data Post-processing (Augmentation, Delivery)
  • 22. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • The fi rst step in user data preparation is called ETL. • Extraction of user data from a source is the fi rst phase of ETL. The literature often considers the “ingestion” and “integration” steps also inside this fi rst part of ETL. • Transform is a mediator phase to apply a set of rules and pre-de fi ned functions to prepare the data to load. The literature often considers “data cleaning” also as a component of this ETL part. • Load is the last phase to place the data in the hosting structure, such as a relational or NoSQL database. Where to obtain (public) user data? • Collect user data using Amazon Mechanical Turk, Survey Monkey, and other similar platforms. • Crawl user data using BeautifulSoup and other similar libraries. The process is also called web scraping. • Download the data from dataset repositories, e.g., UCI, Kaggle, Github, Google Dataset Search, Harvard Dataverse, etc. Extract, Transform, Load (ETL) 18
  • 23. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We crawl data if no direct and easy access is available to the data under question. • Before crawling, we always have to check copyright issues. Also note that some websites offer their own APIs. • Webpages with some regularities are the best candidates for crawling. • Beautiful Soup is a Python library for pulling data out of HTML (https://www.crummy.com/software/ BeautifulSoup/bs4/doc/). Data acquisition using crawling 19 from bs4 import BeautifulSoup import urllib2 url_template = "https://dblp.org/db/conf/sigmod/sigmod2020.html" keywords = ["user data"] page = urllib2.urlopen(url_template) soup = BeautifulSoup(page, "html.parser") papers = soup.findAll("span", {"class": "title"}) for paper in papers: paper_str = paper.text for keyword in keywords: if paper_str.find(keyword) != -1: print(paper_str) break
  • 24. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Task. Write a Python code that automatically fi nd all research papers (and their authors) about a given set of keywords , where is an input parameter. • Download the Python code paper- fi nder.py in the following link, and complete it: https://drive.google.com/ drive/folders/1M-HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing. • Missing parts are marked with a TODO comment. 𝒲𝒲 Hands-on 1: Research paper fi nder 20 DM Authors dataset is build in the same way. Available in PerSCiDO platform via https://doi.org/ 10.18709/perscido.2016.10.ds32 [Omidvar-Tehrani, Amer-Yahia, Termier @ CIKM’15]
  • 25. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Nowadays most web pages are highly dynamic, and such dynamic content is more arduous to coalesce. • ScrapingBee is a library for headless web browsing. It emulates human behavior so that websites don’t block the crawling process. • Selenium is an open-source project for browser automation. The following code crawls a webpage protected with login. Advanced data collection 21 from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() options.headless = True driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH) driver.get("https://news.ycombinator.com/login") print(driver.page_source) login = driver.find_element_by_xpath("//input").send_keys(USERNAME) password = driver.find_element_by_xpath("//input[@type='password']").send_keys(PASSWORD) submit = driver.find_element_by_xpath("//input[@value='login']").click() driver.quit()
  • 26. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Data cleaning refers to a process of detecting and 
 removing noise in data. • The cleanliness of data can be evaluated using 
 different measures such as validity, accuracy, 
 completeness, consistency, and uniformity. • User data cleaning techniques: • Dealing with missing values • Dealing with outliers • Data improvement • Data tidy-up • Scaling User data cleaning 22
  • 27. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Missing values are considered as noise. • In user datasets, many attribute values are missing (e.g., gender, occupation, visitation date, etc.) • When the data is missing, we either follow dropping or imputation technique. • Dropping is often performed using a threshold. • Imputation preserves the data size, hence more preferable to dropping. • Numerical imputation. Consider a default value for the missing data for instance 0 to replace None. Median is another value to consider (why not average?) • Categorical Imputation. Replace the missing values with the maximum occurred value in a column, otherwise use “other”. User data cleaning techniques: missing values 23 threshold = 0.7 #Dropping columns with missing value rate higher than threshold data = data[data.columns[data.isnull().mean() < threshold]] #Dropping rows with missing value rate higher than threshold data = data.loc[data.isnull().mean(axis=1) < threshold]
  • 28. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Outliers are considered as potential noise. • An outlier is a piece of data that doesn’t look normal. • Methods for outlier detection are visualization (the most 
 effective method), standard deviation, and percentiles. • If a value has a distance to the average higher than X times 
 standard deviation, it can be assumed as an outlier. • A certain percent of the value from the top or the bottom 
 can be considered as an outlier. • Outlier values can be either dropped or capped. • Akin to missing data techniques, the former doesn’t maintain the 
 data size, while the latter does. User data cleaning techniques: outliers 24 Is Brazil an outlier? What about Burundi?
  • 29. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Data cleaning is not always about reducing noise, but also increasing the utility of user data. • Examples of data improvement techniques are binning and log transform. User data cleaning techniques: data improvement 25 Percentage binning Log transform
  • 30. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • A user dataset is called tidy iff every row represents a user and every column represents a feature. • Tidy datasets are easy to manipulate, model and visualize. • Grouping is the process of making an un-tidy data, tidy. Common grouping operations are average, sum, and concatenation. • Is ungrouping (tidy to untidy) necessary too? User data cleaning techniques: data tidy-up 26 user score user score u1 65 u3 60 u2 14 u2 30 u1 32 u1 90 user average score u1 62.33 u2 22 u3 60 Transaction user dataset (un-tidy) Tidy user dataset Grouping
  • 31. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Data cleaning frameworks 27 by Michael Stonebraker (ACM Turing Award winner) focusing on data mastering and uni fi cation. Apple inductiv by Christopher Ré, Ihab Ilyas, and Theodoros Rekatsinas focusing on employing arti fi cial intelligence to automate the task of identifying and correcting errors in data. by same leaders of inductiv focusing on providing a a Machine Learning system for data repair and predictions on structured data. OpenCloud by NYU Data Science focusing on providing a Python library for data preprocessing and cleaning. by Laure Berti-Equille focusing on providing a Python library for data preprocessing and cleaning based on Q-Learning.
  • 32. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Question. You are the head of a data engineering team in a healthcare company. Their user data is entered manually by nurses and hence is noisy, which means it includes many missing and possibly inaccurate values in patient information. How do you prioritize between the data cleaning techniques? Poll: Prioritizing data cleaning techniques 28 Votes 0 1 3 4 5 Data cleaning techniques Feature split Dropping Grouping Scaling Imputation Binning Log transform
  • 33. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event User data visualization • Sensemaking of user data using visual variables. • A visualization component consists of three building blocks: views, visual variables and visual elements. 
 • Visualization can be done either at the beginning or at the end 
 of UDA pipelines, for hypothesis testing and validation, 
 respectively. • At the core of visualizing user data is a mapping function that 
 associates user characteristics with visual variables. • The following is the visualization of 
 MovieLens dataset. 
 29 (a) View Visual variables Visual elements [Zegarra et al., FGCS’20] [Heer and Hellerstein, VLDB’09]
  • 34. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • User data can be visualized with typical visualization tools such as Tableau, or with more specialized approaches such as graph-based or location/time-based visualization. Types of visualization 30 Off-the-shelf visualization Graph-based visualization Geospatial and temporal visualization Application-dependent visualization NodeTrix [Henry et al., TVCG’07] Freund et al.: Bike-Sharing Analytics 10 Article submitted to Interfaces; manuscript no. (Please, provide the mansucript number!) Figure 2 The Screenshot Shows Older Versions of the Developed Map in NYC and Washington D.C. Note. The circles on the map indicate to dispatchers which stations should have bikes added (in blue) and which ones should have bikes removed (in red), with the area of each circle proportional to the recommended number. Map data: c 2018 Google. significant implications for Motivate’s operations. In particular, the unique minimum at each station provides a natural target for rebalancing at a given point in time. Motivate uses these target levels in a decision aid we developed to guide dispatchers over the course Bike angels [Chung et al., COMPASS’18] 19 Figura 2.7: Feature Driven System overview Interesting phases of a single player can be automatically found by applying the clustering appro- ach. In this figure, they analyze a forward and are interested in the attacks that the player was involved. Resulting phases can be inspected using the small-multiples view (top-right panel) in combination with the other rendering layers and Horizon Graphs (left and bottom panels). projections, and compare it to traditional heatmaps. Soccer analytics [Machado et al., CG’17] Players are users and their actions are visualized to obtain insights. Dispatchers are informed for adding (in blue) or removing (in red) of bikes for the stations. User groups are shown using node-link diagrams and adjacency matrices. Visualization grammars [Satyanarayan et al., TVCG’17] Visual grammars facilitate creating, saving, and sharing visual analytics.
  • 35. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • D3.js is a JavaScript library web-based visualization. (Why web-based?) • D3 stands for Data-Driven Documents. • The starting point is often from the visualization zoo at 
 https://d3js.org. Web-based visualization 31 Developed by Jeffrey Heer in University of Washington <div id="scatter_area"></div> <script src="https://d3js.org/d3.v4.js"></script> <script> var margin = … var svg = d3.select("#scatter_area") … var data = [ {x:10, y:20}, {x:40, y:90}, {x:80, y:50} ] var x = d3.scaleLinear() … var y = d3.scaleLinear() … svg.selectAll("whatever").data(data).enter() … </script> [Bostock et al., TVCG’11]
  • 36. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • D3.js is a JavaScript library web-based visualization. (Why web-based?) • D3 stands for Data-Driven Documents. • The starting point is often from the visualization zoo at 
 https://d3js.org. Web-based visualization 31 Developed by Jeffrey Heer in University of Washington <div id="scatter_area"></div> <script src="https://d3js.org/d3.v4.js"></script> <script> var margin = … var svg = d3.select("#scatter_area") … var data = [ {x:10, y:20}, {x:40, y:90}, {x:80, y:50} ] var x = d3.scaleLinear() … var y = d3.scaleLinear() … svg.selectAll("whatever").data(data).enter() … </script> var x = d3.scaleLinear() .domain([0, 100]) .range([0, width]); svg.append('g') .attr("transform", "translate(0," + height + ")") .call(d3.axisBottom(x)); [Bostock et al., TVCG’11]
  • 37. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Hands-on 2: D3 histogram 32 The following fi gure shows that the peak hours were around 11AM and 5PM. It also shows that no log-in was done early morning. $ python -m SimpleHTTPServer 8000 // Python 2 $ python3 -m http.server 8000 // Python 3 • Task. We are given a CSV fi le including hours that users logged in to a platform under investigation. Visualize a histogram for this data using D3. • Download the content in the sub-folder D3-Histogram in the following link, and complete it: https:// drive.google.com/drive/folders/1f82RplHgLte223QoD99UIKEM3IJSV4y5?usp=sharing. • Missing parts are marked with a TODO comment. • Important. You need a virtual server to run 
 this example. You can simply use:
  • 38. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Cross fi lter is JavaScript library focusing on fast multidimensional fi ltering for coordinated views. • In other words, Cross fi lter brings interactivity to visualizations. • Source fi les are accessible via https://github.com/cross fi lter/cross fi lter. See examples in https:// drarmstr.github.io/chartcollection/examples/#worldbank. Cross fi lter 33 [Omidvar-Tehrani et al., ICDE’17]
  • 39. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Cross fi lter is JavaScript library focusing on fast multidimensional fi ltering for coordinated views. • In other words, Cross fi lter brings interactivity to visualizations. • Source fi les are accessible via https://github.com/cross fi lter/cross fi lter. See examples in https:// drarmstr.github.io/chartcollection/examples/#worldbank. Cross fi lter 33 [Omidvar-Tehrani et al., ICDE’17]
  • 40. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Cross fi lter is JavaScript library focusing on fast multidimensional fi ltering for coordinated views. • In other words, Cross fi lter brings interactivity to visualizations. • Source fi les are accessible via https://github.com/cross fi lter/cross fi lter. See examples in https:// drarmstr.github.io/chartcollection/examples/#worldbank. Cross fi lter 33 [Omidvar-Tehrani et al., ICDE’17]
  • 41. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Various approaches have been proposed for the visualization of time-based activities of users, in an interactive manner. • EventFlow is an example of leveraging time dimension where groups of users are shown along their temporal actions in a visual interface. (https://hcil.umd.edu/event fl ow/) Time-based visualization 34 [Monroe et al., TVCG’13] Group of patients with common treatments Length of treatments
  • 42. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Behavioral analysis is to extract value from user data. • User data is modeled as a bipartite graph with users on one hand and actions on the other. • User data analysis pipeline contains user data preparation, mining and recommendation, presentation and exploration. • We often obtain user data by collecting, crawling (scraping), or downloading from dataset repositories. • Main tasks in user data cleaning deals with missing values, outliers, data improvement, 
 data tidy-up, and data scaling. • At the core of visualizing user data is a mapping function that associates user 
 characteristics with visual variables. • Visualization of user evolution needs special care. Takeaways from the fi rst session 35
  • 43. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event 36 Raw user data User Data Preparation towards less noise towards less volume User Data Mining, Learning, and Recommendation post- processing User Data Presentation User Data Exploration interaction User User Data Mining and Recommendation SESSION 2
  • 44. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • One important task in UDA is to understand user behavior. • Simply put, we’re interested to know “what users have done” by collecting their interactions with data. • Understanding user behavior bene fi ts businesses, as it helps them envision what services to expand in the future to increase both user satisfaction and revenue. Understanding user behavior 37 Amazon product recommendation © measuringu.com Net fl ix movie recommendation © Medium
  • 45. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We employ user data for two separate tasks: mining and recommendation. • Mining • To understand and represent user behaviors in the captured data. • A famous application in industry is cross-selling: “customer who bought this 
 item also bought …”. • The fundamental assumption is that there exist groups of user activities formed 
 by like-minded users which constitute different instances of user behavior. 
 Hence the main action is grouping. • Recommendation • To predict future user behaviors in the captured data. Recommendation is great approach for personalization. • The fundamental assumption is that there exist a latent relation in user interactions, which can also predict future possible interactions. Hence the main action is relation discovery. User data mining and recommendation 38 http://cliintel.com/diapers-beer-and-data-in-retail/
  • 46. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event User data mining 39 • The main action in user data mining is grouping, which is often resided in an unsupervised context. • We need two elements to group users: a distance function, and representation approach. • The distance function imposes the grouping / mining semantics. It enforces how two users should / should not be placed in a common group. Sometimes it is called similarity function. • The representation approach de fi nes how each mined group should be labeled. In the following example, majority voting is used for representation. Mia likes 60 drama movies and 40 action movies. Group of drama- genre lovers Group of action- genre lovers distance? distance?
  • 47. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Myriads of grouping methods 40 Community and Clique Detection [Newman, Physical J.’04] [Barbieri et al., ICDM’13] [Goyal et al., CIKM’08] Team and Tribe Formation [Nikolaev et al., KDD’16] ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Aged 18−29 Aged 30−44 Aged 45+ Aged under 18 Females Females Aged 18−29 Females Aged 30−44 Females Aged 45+ Females under 18 IMDb staff Males Males Aged 18−29 Males Aged 30−44 Males Aged 45+ Males under 18 Non−US users Top 1000 voters US users 0.0 2.5 5.0 7.5 Average The Social Network, 7.7/10 1 2 3 4 5 The Blair Witch Project (1999) 0.0 0.4 0.8 Population: All, Average: 3 1 2 3 4 5 American Beauty (1999) 0.0 0.4 0.8 Population: All, Average: 4.3 1 2 3 4 5 American Beauty (1999) 0.0 0.4 0.8 Population: Middle-Age, Boston, Average: 3.17 (a) (b) (c) gure 1: (a) Segments on IMDb (b) Segments’ Distributions (c) Segments Exploration with Rating Maps ween the rating distribution of a segment and an input tribution of interest. Second, a scalable algorithm for ploring the huge search space and dynamically building ing maps is imperative. Finally, the segments forming a p must satisfy certain quality criteria: coverage of input ing records, diversity in segment description to show dif- ent facets of the rater population, size of each segment ., not too small), and high proximity of each segment to input distribution. n a nutshell this paper makes the following contributions: 1. We show that several sophisticated distance measures to discriminate between distributions. We show that the rth Mover’s Distance (EMD) [20] is able to capture subtle erences between two distributions and is appropriate for building rating maps. Section 3 performs a study of various distance measures. In Section 4.2, we discuss DTAlg, along with the RF heuristics. Our experimental study and findings are given in Section 5. Related work is discussed in Section 6. Section 7 summarizes and concludes the paper. 2. DATA MODEL A rated dataset consists of a set of users with schema SU , items with schema SI and rating records with schema SR. For example, SU = huid, age, gender, state, cityi and a user instance may be hu1, young, male, NY , NYCi. Similarly, movies on IMDb can be described with SI = hitem id, title, genre, directori, and the movie Titanic Segment Discovery [Amer-Yahia et al., WWW’2017] Pattern and Cube Mining [Xin et al., KDD’06] [Kamat et al., ICDE’14] Clustering and Partitioning [Agrawal et al., ACM’1998] [Pedreira et al., VLDB’16] Cohort Representation [Jiang et al., VLDB’16] [Omidvar-Tehrani, Amer-Yahia, Lakshmanan @ DSAA’18]
  • 48. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • One common mining approach is clustering. • K-means is the status quo in clustering which is an iterative expectation-maximization (EM) approach to update the parameters of each cluster until convergence. • Cluster centroids are representatives. • K-means is a hard clustering method. • K-means clusters are radial. Clustering: k-means 41 input parameter k centroids ← k random users repeat until convergence: for all users: find the centroid closest to the user assign the user to the cluster of that centroid (expectation) update the centroid (maximization) return k centroids © iChrome k=2 k=3 k=4 k=5
  • 49. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Hard clustering bears no uncertainty. • In real user data, users often belong to more than one group. • A generalized non-hard clattering approach is Gaussian Mixture 
 Models (GMM). • The idea is to represent each cluster with a Gaussian distribution 
 in lieu of a centroid. Hence the whole model contains k different 
 distributions. • The objective is to maximize the fi t between the data points in each 
 cluster and its representative distribution, using maximum likelihood estimation. Clustering: Gaussian Mixture Models (GMM) 42 © Oscar Contreras Carrasco @ towards data science
  • 50. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Density-based spatial clustering of applications with noise (DB-Scan) is grouping method based on both distance and minimum number of points. The combination of the two parameters creates a notion of neighborhood. The resulting clusters are not necessarily radial. Clustering: DB-Scan 43 © KDnuggets input parameter d and nbu // d = distance, nbu = minimum number of users find the users in the d-neighborhood of every user, and identify core users with more than nbu neighbors. find the connected components of core users on the neighbor graph, ignoring all non-core users. assign each non-core user to a nearby cluster if the cluster is a d-neighbor, otherwise noise. return clusters
  • 51. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Clustering algorithms can also be employed as commodity using high-level Python libraries. • Among many successful libraries, scikit-learn is a popular and standard one. • For k-means, given the data and the number of clusters, the library does the rest. • For DB-Scan, given the data, the distance and the minimum number of users, the library does the rest. Python libraries for clustering 44 # k-means from sklearn.cluster import KMeans import numpy as np data = np.array([[1, 2], [1, 4], …) clusters = KMeans(nb_clusters=2).fit(data) print(clusters.labels_) #[1, 1, 1, 0, 0, …] print(clusters.predict([12, 3]) # 0 # DB-Scan from sklearn.cluster import DBSCAN import numpy as np data = np.array([[1, 2], [2, 2], …) clusters = DBSCAN(eps=3, min_samples=2).fit(data) print(clusters.labels_) # [ 0, 0, 0, 1, 1, -1, …] predictions = clusters.fit_predict(new_data) # 1
  • 52. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • From human perspective, the representativity of all previous grouping approaches is feeble. • As explainability matters in AI (XAI trend), it is desirable to have a soft non-radial grouping method which represents groups in a human-understandable form, e.g., “group of students who participate in RAIS summer school.” • Frequent Itemset Mining (FIM) is often considered as a method for market 
 basket analysis. • The initial goal is to fi nd sets of products that are frequently bought together. • Each frequent itemset is a describable group. Frequent Itemset Mining 45 Some Rep © S. Harris @ ScienceCartoonsPlus.com
  • 53. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event FIM: De fi nitions 46 • We are given a set of items , where any subset of is an itemset. • We are also given a transaction (un-tidy) dataset where each member of is an itemset. • Given an itemset , is the number of transactions containing . • An itemset is a frequent itemset if , where is the minimum support threshold. • Given two item sets and , an association rule with con fi dence holds, if ( is the minimum con fi dence threshold ), where . ℐ ℐ 𝒯𝒯 X ⊆ ℐ support(X) X X ⊆ ℐ support(X) ≥ δ δ X ⊆ ℐ Y ⊆ ℐ X → Y c c ≥ δ′  δ′  c = (support(X ∪ Y))/(support(X))
  • 54. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event FIM: Example 47 User watched u1 User watched u2 User watched u3 User watched u3 User watched u5 User watched u6 Transaction user dataset {The Terminal, Forrest Gump, The Pianist} is a frequent itemset. 
 absolute support = 4 
 relative support = 4/5 = 60% {Forrest Gump, The Pianist} → {The Terminal} is an association rule. 
 con fi dence = 4/6 = 66% {The Pianist} → {The Terminal, Forrest Gump} is another association rule. 
 con fi dence = 4/5 = 80%
  • 55. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event FIM: Computation 48 • Apriori algorithm. It is a level-wise search ( fi rst 1-itemsets, then 2-itemsets, …) which exploits the following pruning opportunity: if an itemset is not frequent, then all its supersets are not frequent. • For instance, if {Psycho, Unhinged} is not frequent, then of course {Psycho, Unhinged, The Pianist} won’t be frequent either. • For instance, given the minimum support threshold equal to 2, the itemset {young, CA, student} is not frequent, and not its superset either. [Agrawal et al., SIGMOD’93] 6 Multi-Objective Group Discovery on the Social Web (Technical Report) ha2, v2i, . . . , han, vni}, n  k, we say that g covers r, denoted as r l g, i↵ 8i 2 [1, n], 9r.vj such that vj is a set of values for attribute g.ai and g.vj ✓ r.vi. For example, the rating hfemale, DC, student, 4i is covered by the group {hgender, femalei, hlocation, DCi}. {} #records=3662 {male, young} #records=1588 {CA, student} #records=20 {male} #records=2634 {young} #records=2147 {CA} #records=664 {student} #records=184 {male, young, CA} #records=268 {male, young, CA, student} #records=2 {young, CA} #records=375 {male, student} #records=120 {male, CA} #records=477 {young, student} #records=13 {young, CA, student} #records=2 {male, young, student} #records=13 {male, CA, student} #records=17 [Omidvar-Tehrani, Amer-Yahia, Dutot, Trystram @ PKDD’16]
  • 56. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event FIM for mining describable groups of users 49 • We employ an ef fi cient implementation of Apriori called LCM for mining groups in user data. • Step 1. Identi fi ers for both users and items should be mapped to a non-negative integer space (required by LCM). For instance if the movie Titanic (as an item) is mapped to “25” and the user “John” is also mapped to “120”, the tuple <120,25> means that John has watched the movie Titanic. • Step 2. We transform a tidy dataset to an un-tidy (transactional) dataset, where each line represents one user and the whole item IDs associated to the user will be listed in that line separated by space. • Step 3. Run LCM to mine groups. • Each line in the output fi le returned by LCM represents one group. [Takeaki et al., Discovery Science ’04] http://research.nii.ac.jp/~uno/code/lcm.html
  • 57. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • With the approach discussed in the previous slides, we can obtain groups solely on the co-occurrence of items. • It is more desirable to mix demographics and items to obtain groups such as “middle-aged females in Grenoble who watched The Terminal and Forrest Gump.” • It is possible to encode user attributes in the same transactional database. Then LCM will give us full- fl edged groups. Full- fl edged behaviors in user data mining 50 user gender age movies watched u1 F Young Terminal, Forrest., Pianist, Psycho, Unhinged u2 F Middle Terminal, Forrest., Pianist, Unhinged u3 M Middle Pianist u4 F Young Forrest., Pianist u5 F Middle Terminal, Forrest., Pianist, Psycho u6 M Middle Terminal, Forrest., Pianist movie code Terminal 1 Forrest. 2 Pianist 3 Psycho 4 Unhinged 5 attribute value code Female 101 Male 102 Young 103 Middle 104 line # Transaction 1 1 2 3 4 5 101 103 2 1 2 3 5 101 104 3 3 102 104 4 2 3 101 103 5 1 2 3 4 101 104 6 1 2 3 102 104 [1 2 3 101 104] (2) [2 5] [Terminal Forrest. Pianist Female Middle] (2) [u2 u5] un-tidy LCM translate
  • 58. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Hands-on 3: Mining user groups 51 • Step 1. Find MovieLens 1M dataset dataset on a dataset repository and download. The dataset contains movies that users appreciated watching. We only need the fi le ratings.dat. • Step 2. Download the Python fi le pmr.py in the following link, complete it: https://drive.google.com/drive/folders/ 1xMxGdcI2IGgTAhozDUqSfZAWzKVXfkjr?usp=sharing. • Step 3. Run the code to obtain the output fi le pmr.txt. • Step 4. Download LCM software from the following link: https://drive.google.com/drive/folders/ 1xMxGdcI2IGgTAhozDUqSfZAWzKVXfkjr?usp=sharing. • Step 5. Put the dataset fi le in the same folder as LCM. • Step 6. Run LCM as follows: • Step 7. Open the output fi le out.txt. Each line in the fi le out.txt represents a group in the following structure: [set of items] (support) [set of users]. The description of the group is [set of items]. The set of group members is [set of users]. • Step 8. Try to fi nd 5 interesting user groups. ./lcm CfI -l 5 -u 100 pmr.txt 3 out.txt
  • 59. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Question. Following the steps in the previous hands-on, what is the most challenging aspect of mining groups which remains unsolved? Poll: Challenge of mining user groups 52 Votes 0 1 2 3 4 Challenges of user data mining Ef fi ciency Overlap Size of clusters Explainability Mechanism Binning
  • 60. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event User data mining for advanced decision making 53 • Both clustering and frequent itemset mining are based on the idea of density maximization. • But is density what the end-user really desire to achieve? • Oftentimes, more quality measures are required, such as coverage, diversity, and variance. © prototypr.io
  • 61. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Multi-objective optimization 54 • This makes a multi-objective optimization problem. • Given set of ratings , identify all group-sets where each group-set satis fi es: • is maximized; • is maximized; • is minimized; • The problem is proved to be NP-Complete by a reduction from the Exact 3-Set Cover problem (EC3). R G coverage(G, R) diversity(G, R) diameter(G, R) Ensuring that most input records belong to at least one group in the output. Ensuring that found groups are as different as possible from each other. Ensuring that ratings within each group are homogenous. [Omidvar-Tehrani, Amer-Yahia, Dutot, Trystram @ PKDD’16]
  • 62. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Diameter objective 55 • Diameter is a simple but effective measure of variance in ratings. • Below, we observe that most reviewers agree on a high score for the movie Godfather → minimum diameter. • We also observe that the reviewers are divided when voting on Fifty Shades of Grey → maximum diameter. Count (%) 0 15 30 45 60 Rating scores 1 2 3 4 5 6 7 8 9 10 Rating Distribution Other rating distributions like increasing, decreasing, heterogeneous, etc. Rating distribution of The Godfather (1972) in IMDb Homogeneous Rating Distribution Minimum diameter Count (%) 0 7.5 15 22.5 30 Rating Scores 1 2 3 4 5 6 7 8 9 10 Rating distribution of Fifty Shades of Grey (2015) in IMDb Polarized Rating Distribution Maximum diameter Count (%) 0 15 30 45 60 Rating scores 1 2 3 4 5 6 7 8 9 10 Rating Distribution Other rating distributions like increasing, decreasing, heterogeneous, etc. Rating distribution of The Godfather (1972) in IMDb Homogeneous Rating Distribution Minimum diameter Count (%) 0 7.5 15 22.5 30 Rating Scores 1 2 3 4 5 6 7 8 9 10 Rating distribution of Fifty Shades of Grey (2015) in IMDb Polarized Rating Distribution Maximum diameter
  • 63. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Pareto group discovery 56 • A bottom-up exhaustive approach to discover Pareto front. • Generating fewer plans makes a Multi-Objective optimization algorithm run faster. Optimization-based User Group Management: Discovery,Analysis, Recommendation - November 6, 2015 Bottom-up exhaustive approach to discover Pareto front. 0.5 10 User Groups as Pareto Fronts Diversity 0 1 0.5 Coverage 0 1 Candidate Group-set Dominance Area Rejected Group-set Pareto Group-set α-Dominance Area α Rejected Group-set in case of α- dominance Bottom-up exhaustive approach to discover Pareto front. 0.5 User Groups as Pareto Fronts Diversity 0 1 0.5 Coverage 0 1 Candidate Group-set Dominance Area Rejected Group-set Pareto Group-set α-Dominance Area α Rejected Group-set in case of α- dominance
  • 64. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event An approximation algorithm for Pareto group discovery 57 1. Inputs are , , 2. Output is the Pareto result set 3. 4. For all user groups do 1. ← Singleton group-set containing g 2. If is not -dominated by any other group-set , then add to 5. For do 1. For each possible group-set of size do 1. If is not -dominated by any other group-set , then add to 6. Return k α > 1 R 𝒫𝒫 ← ∅ g G G α ∈ 𝒫 G 𝒫 n ∈ [2,k] G n G α ∈ 𝒫 G 𝒫𝒫
  • 65. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Hands-on 4: Mining multi-objective user groups 58 • Step 0. We continue the previous hands-on. So we need the mined groups. • Step 1. Download and unzip the fi le MOMRI.zip at the following https://drive.google.com/drive/folders/1M- HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing. It is a Java NetBeans project whose main package is “MOQO.MRI” and whose main executable is MOMRI.java. • Step 2. Run the algorithm. The output of the algorithm 
 reports the progress in fi nding Pareto plans. • Step 3. Add a new objective to the optimizer. • Download the documentation at https://drive.google.com/ fi le/d/1BE1jL2Lp327_Lxb1MMudY2p6l1tG_Uj4/view? usp=sharing. Input data. The parameter “ds” (line 21 of MOMRI.java) specifies the name of the da use. MovieLens 1M (ds=“ml1m”) is considered as the default dataset. You can also t MovieLens 100K dataset (ds=“ml100k”). The method “read ratings()” in line 30 of MOMRI.java reads ratings from the data file on disk. The data file is hosted in the “da Executable file Parameters Output
  • 66. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Recommendation systems 59 • Recommendation systems are designed to automatically fi nd relevant and desirable items to be consumed by users in the future. • In general, those systems work by means of predicting items that are likely to be the most appealing to users based on their preferences. • Intuitively, the problem of recommendation reduces to fi lling missing values in the user-item interaction matrix. [Amer-Yahia and Benouaret, BigData’20] Terminal Forrest. Pianist Psycho Unhinged u1 5 4 5 4 3 u2 4 5 5 u3 4 u4 3 3 u5 3 2 3 2 u6 3 4 2 Question. How would u2 rate the movie Psycho in the future? Answer. Probably like others users similar to u2, like u1 or u5. Question. Is u2 more similar to u1 or u5? Answer. Following their ratings for The Terminal, Forrest Gump, and The Pianist, u2 is more similar to u1. Hence u2 would probably rate Psycho around 4, like what u1 did. Multi-scale rating user-item interaction matrix
  • 67. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Types of recommendation systems 60 • Rule-based approaches used to be the dominant method for recommendation. It is still used in industry. • Most common state-of-the-art approaches are content-based fi ltering and collaborative fi ltering. • Content-based fi ltering recommends items based on ones that the user liked before. • Collaborative fi ltering recommend items which are popular among the neighbors of the user. Nina likes 60 drama movies, 20 romance, and 20 action. La Vie en Rose (Biography, Drama) Me before You (Drama, Romance) Memento (Mystery, Thriller) 60% sim. 0% sim. 80% sim. “Me before You” will be ranked higher than “La Vie en Rose” in Nina’s content-based recommendation. Nina’s taste overlaps with Stephanie and Charles. more impact less impact “La Vie en Rose” will be ranked higher than “Memento” in Nina’s collaborative recommendation. CONTENT-BASED COLLABORATIVE Stephanie has the same taste as Nina and likes “La Vie en Rose” more than “Me before You”. Charles ’s taste is somewhat different form Nina’s, and he likes “Memento” more than “Me before You”.
  • 68. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Collaborative fi ltering 61 • As collaborative fi ltering (CF) captures “like-minded behaviors”, it is often a favorite recommendation option. • Two methods are proposed for implementing a CF approach: memory-based and model-based. • In a memory-based implementation, the entire user-item interaction matrix is employed. • In a model-based implementation, a model of users is developed to learn their preferences. Towards more simplicity Towards more ef fi ciency model-based memory-based
  • 69. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Similarity between users 62 • An important step in recommendation is to to compare all users to the input user and fi nd the one that is most similar. • This is done using Pearson correlation. • To measure the similarity between the tastes of Sara and Anderson, let’s assume x is the taste vector fo Sara and y is Anderson’s, both rating n movies. • The value r could be in the range -1 to +1, where +1 means that Sara and Anderson have perfectly similar tastes, and -1 means the opposite. • In practice, this correlation cannot be computed for any single user, hence we often user a small sample.
  • 70. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Memory-based CF: user-based • Common implementations are user-based and item-based. We practice the former. 63 import pandas as pd import numpy as np movies_df, ratings_df = read_data(…) user_preferences = pd.DataFrame() user_subset = ratings_df[ratings_df["movie_id"].isin(user_preferences["movie_id"].tolist())] user_Subset_group = userSubset.groupby(["user_id"]) user_Subset_group = sorted(user_subset_group, key=lambda x: len(x[1]), reverse=True) user_subset_group = user_subset_group[0:100] pearson_correlation_dict = {} for name, group in user_subset_group: pearson_correlation_dict[name] = pearson_correlation(user_preferences, group) top_users = pearson_correlation_dict.sort()[0:50].merge(ratings_df) top_users_rating["weighted_rating"] = top_users_rating["sim"] * top_users_rating["rating"] recommendation_df = top_users_rating.groupby("movie_id").sum()[["sim","weighted_rating"]] recommendation_df.average().sort() final_rec = movies_df.loc[movies_df["movieId"].isin(recommendation_df.head(10)["movieId"].tolist())]
  • 71. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Task. We are given a list of liked movies. Provide top-10 recommendations. • Download the Python fi le cf.py in the following link, and complete it: https://drive.google.com/drive/folders/1M- HlNao9tYwqN0imeZ-SzHnGZKMoJgh4?usp=sharing. • Missing parts are marked with a TODO comment. Hands-on 5: Memory-based collaborative fi ltering 64 © streamingclarity.com
  • 72. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Model-based CF: matrix factorization 65 • CF as a “neighborhood” method, focusing on maximizing “closeness”, does not handle scalability issues and noise. • CF performs on low-level (raw) data which does not capture well the similarities between users on higher levels. • Matrix Factorization is a solution for both aforementioned issues. • Factorization is a simple but principle operator in mathematics, e.g., representing “12” with its factors, which are “4’ and “3”. • In the context of recommendation, it is the task of factorizing the user-item interaction matrix into two matrices corresponding to users and items.
  • 73. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Singular Value Decomposition for Matrix Factorization 66 • Among different ways of factorizing matrices, Singular Value Decomposition (SVD) is of particular interest in the recommendation domain. • SVD is an algorithm that decomposes an interaction matrix R into into the “best” lower rank approximation of R. • The main SVD equation is as follows: , where is the diagonal matrix of singular values (weights). R = QΣPT Σ © CodingFox
  • 74. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Model-based CF with SVD • To get the lower rank approximation, we employ SVD and maintain the top k latent features, which are the most important underlying taste. • For illustration purposes, we consider k = 2, but k ~ 50 is more natural. 67 import pandas as pd import numpy as np from scipy.sparse.linalg import svds # step 1 ratings_df, users_df, movies_df = get_data(…) # step 2 ratings_pivot_df = ratings_df.pivot() U, sigma, Vt = svds(ratings_pivot_df, k = 2) sigma = np.diag(sigma) # step 3 predictions = np.dot(np.dot(U, sigma)) Terminal Forrest. Pianist u1 4.5 3 ?? u2 5 5 2 Step 1 (original dataset) f1 f2 u1 1.1 2.3 u2 2.1 1 f1 f2 Terminal 1.9 1 Forrest. 2.3 0 Pianist 0 2 Terminal Forrest. Pianist f1 1.9 2.3 0 f2 1 0 2 Matrix U Matrix V Matrix Vt Terminal Forrest. Pianist u1 4.39 2.53 4.6 u2 4.99 4.83 2 Step 3 (reconstructed dataset) Step 2
  • 75. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Deep learning for recommendation 68 • So far, we covered neighborhood and matrix factorization methods for recommendation. • For more ef fi ciency and precision, we also look at deep approaches, i.e., the active trend in recommendation. • Deep learning has hunger for data, hence we often user implicit-feedback data rather than explicit-feedback. Terminal Forrest. Pianist Psycho Unhinged u1 5 4 5 4 3 u2 4 5 5 u3 4 u4 3 3 u5 3 2 3 2 u6 3 4 2 Explicit-feedback interaction matrix Terminal Forrest. Pianist Psycho Unhinged u1 1 1 1 1 1 u2 1 1 1 0 0 u3 0 0 1 0 0 u4 0 1 1 0 0 u5 1 1 1 1 0 u6 1 1 1 0 0 Implicit-feedback interaction matrix
  • 76. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Neural Collaborative Filtering (NCF) 69 • We employ a simple but ef fi cient implementation of a deep neural network for recommendation, called Neural Collaborative Filtering (NCF). import pandas as pd import numpy as np import torch.nn as nn ratings = read_data() # make the algorithm scalable ratings = filter_to(ratings, 0.1) train_ratings, test_ratings = split_train_test(ratings) # mark all seen data as “1” and … # … pick a few negative examples users, items, labels = make_implicit_data(train_ratings) model = NCF(num_users, num_items, train_ratings, movies) trainer = trainer(max_epochs=5) trainer.fit(model) [He et al. ArXiv’17]
  • 77. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Neural Collaborative Filtering (NCF) 69 • We employ a simple but ef fi cient implementation of a deep neural network for recommendation, called Neural Collaborative Filtering (NCF). import pandas as pd import numpy as np import torch.nn as nn ratings = read_data() # make the algorithm scalable ratings = filter_to(ratings, 0.1) train_ratings, test_ratings = split_train_test(ratings) # mark all seen data as “1” and … # … pick a few negative examples users, items, labels = make_implicit_data(train_ratings) model = NCF(num_users, num_items, train_ratings, movies) trainer = trainer(max_epochs=5) trainer.fit(model) # step 1 random_users = np.random.choice(ratings['user_id'].unique(), size=int(len(ratings['user_id’].unique()) * 0.1), replace=False) # step 2 ratings = ratings.loc[ratings[‘user_id'].isin(random_users)] [He et al. ArXiv’17]
  • 78. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Neural Collaborative Filtering (NCF) 69 • We employ a simple but ef fi cient implementation of a deep neural network for recommendation, called Neural Collaborative Filtering (NCF). import pandas as pd import numpy as np import torch.nn as nn ratings = read_data() # make the algorithm scalable ratings = filter_to(ratings, 0.1) train_ratings, test_ratings = split_train_test(ratings) # mark all seen data as “1” and … # … pick a few negative examples users, items, labels = make_implicit_data(train_ratings) model = NCF(num_users, num_items, train_ratings, movies) trainer = trainer(max_epochs=5) trainer.fit(model) # step 1 random_users = np.random.choice(ratings['user_id'].unique(), size=int(len(ratings['user_id’].unique()) * 0.1), replace=False) # step 2 ratings = ratings.loc[ratings[‘user_id'].isin(random_users)] # step 1 ratings['rank_latest'] = ratings.groupby(['user_id']) ['timestamp'].rank(method='first', ascending=False) # step 2 train_ratings = ratings[ratings['rank_latest'] != 1] test_ratings = ratings[ratings['rank_latest'] == 1] [He et al. ArXiv’17]
  • 79. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NCF architecture 70 • Akin to the notion of latent factors in MF, the input to the network is user and item embeddings. class NCF(): def __init__() … def forward(self, user_input, item_input): user_embedded = self.user_embedding(user_input) item_embedded = self.item_embedding(item_input) vector = torch.cat([user_embedded, item_embedded], dim=-1) vector = nn.ReLU()(self.fc1(vector)) vector = nn.ReLU()(self.fc2(vector)) pred = nn.Sigmoid()(self.output(vector)) return pred def training_step(self, batch): user_input, item_input, labels = batch predicted_labels = self(user_input, item_input) loss = nn.BCELoss()(predicted_labels) return loss def configure_optimizers(self): return torch.optim.Adam(self.parameters()) def train_dataloader(self): return DataLoader(ratings, batch_size=512, num_workers=4) © James Loy @ Kaggle
  • 80. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Group recommendation 71 • The outcome of a typical recommendation engine is a personalized top-k recommendation list. • What if a group of users want to receive recommendations that they all appreciate collectively? • A naïve approach towards group recommendation is the creation of a virtual user. Predictions for Olivia: rating(“Me before You”) = 1 rating(“Memento”) = 3 Predictions for Julia: rating(“Me before You”) = 1 rating(“Memento”) = 1 Julia Olivia Jacob Predictions for Jacob: rating(“Me before You”) = 5 rating(“Memento”) = 3 Question. Which movie should the group watch together? Answer. Consider them as a virtual user with average rating. Question. The average for both movies will become 2.33!! Alternative? Answer. Consider them as a virtual user with least misery. Question. The least misery score for both is 1!! Alternative? Answer. …!
  • 81. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Solving the group recommendation problem 72 • Problem. Given user group , return best items to recommend (denoted as ) to during period such that • contains items. • Every item in is new to all members of . • There does not exist any other item whose score is higher than any item in . • Solution. A top-k processing algorithm is proposed. • We materialize lists such as static af fi nity, absolute preference and dynamic af fi nity, and then scan all lists in round-robin fashion (like NRA) followed by a buffer update. • We terminate using a stopping condition. G k IG G p IG k IG G IG [Basu Roy et al., VLDBJ’10 and ICDE’14]
  • 82. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event Top-k processing 73 • Top- processing is a series of algorithms with the aim of fi nding items that best answer a user’s query. • The performance of a top-k processing algorithm is measured in 
 terms of number of sequential accesses (SAs) and random accesses 
 (RAs) it makes. • For instance, you access your third favorite music on an audio tape 
 using an SA, and on Spotify (or essentially or hard drive) using an RA. • The naïve computation of top-k is to compute the score of each item, 
 sort them in decreasing order, and return the top-k. When we have billions of items, this approach is infeasible. • An alternative idea is to throw space at the problem, by pre-computing inverted lists and scanning them, with a stopping condition. • Famous algorithms in this genre are TA and NRA. We review the latter here. k k
  • 83. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event No-Random-Access (NRA) algorithm 74 • Access all lists sequentially and in parallel. • After each cursor move compute • Worst-case score , best-case score for each seen ( is an item, e.g., a movie or a book) • Sort all seen items on ,breaking ties by • if then • add to buffer • • else if • add to candidates • Stop if candidates • Return the top- items W(r) B(r) r r W(r) B(r) W(r) > mink r mink = min(W(r′  ) ∀r′  ∈ B) B(r) > mink r B(d′  ) ≤ mink ∀d′  ∈ k Predictions for Julia Titanic, 1 Terminal, 0.2 Predictions for Jacob God Father, 3.3 Titanic, 1.4 Predictions for Olivia Titanic, 2.3 God Father, 0.1 … 1 2 1 2 … Sequential access (SA) Random access (RA)
  • 84. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 1) 75 • We initialize cursors at the head of each list. We assume (hence the buffer size) and we have space to keep track of 10 candidates. k = 3 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 0 SA’s have been performed hitherto. mink = ? [jump to end of this example]
  • 85. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 2) 76 • We move the cursors sequentially. • We complete the buffer by adding movies r7, r1, and r2 to it. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 3 SA’s have been performed hitherto. mink = ?
  • 86. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 2) 76 • We move the cursors sequentially. • We complete the buffer by adding movies r7, r1, and r2 to it. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 3 SA’s have been performed hitherto. mink = ?
  • 87. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 2) 76 • We move the cursors sequentially. • We complete the buffer by adding movies r7, r1, and r2 to it. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 3 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r1 1.5 6.5 3 r2 1.5 6 Candidates 4 5 6 7 8 9 10 mink = ?
  • 88. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 2) 76 • We move the cursors sequentially. • We complete the buffer by adding movies r7, r1, and r2 to it. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 3 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r1 1.5 6.5 3 r2 1.5 6 Candidates 4 5 6 7 8 9 10 mink = ? mink = 1.5
  • 89. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 3) 77 • Once the buffer is complete, we check whether a new movie is worth to be added to the buffer. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 4 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r1 1.5 6.5 3 r2 1.5 6 Candidates 4 5 6 7 8 9 10 mink = ? mink = 1.5
  • 90. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 3) 77 • Once the buffer is complete, we check whether a new movie is worth to be added to the buffer. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 4 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r1 1.5 6.5 3 r2 1.5 6 Candidates 4 5 6 7 8 9 10 mink = ? mink = 1.5
  • 91. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 3) 77 • Once the buffer is complete, we check whether a new movie is worth to be added to the buffer. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 4 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r1 1.5 6.5 3 r2 1.5 6 Candidates 4 5 6 7 8 9 10 mink = ? mink = 1.5 Given and , should it be added to the buffer? worsecase(r3) = 4.5 bestcase(r3) = 6
  • 92. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 3) 77 • Once the buffer is complete, we check whether a new movie is worth to be added to the buffer. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 4 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r1 1.5 6.5 3 r2 1.5 6 Candidates 4 5 6 7 8 9 10 mink = ? mink = 1.5 Given and , should it be added to the buffer? worsecase(r3) = 4.5 bestcase(r3) = 6 Given that , then YES! worsecase(r3) > mink
  • 93. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 4) 78 • Some items will gradually transition from the buffer to the candidates (e.g., r2). movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 5 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r3 4.5 6 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5
  • 94. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 4) 78 • Some items will gradually transition from the buffer to the candidates (e.g., r2). movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 5 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r3 4.5 6 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5
  • 95. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 4) 78 • Some items will gradually transition from the buffer to the candidates (e.g., r2). movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 5 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r7 5 6.5 2 r3 4.5 6 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5 gets updated but stays at 1.5. mink
  • 96. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 5) 79 • We have to check the stopping condition after each SA. • We stop if . max(bestcase(candidates)) < mink movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 5 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5 max
  • 97. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 5) 79 • We have to check the stopping condition after each SA. • We stop if . max(bestcase(candidates)) < mink movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 5 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5 max ⚠ Should we stop? 6 1.5, then NO! ≮
  • 98. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 6) 80 • For any new movie, we check if it should be added to the buffer. After the buffer update, we check the stopping condition. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 6 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5 ⚠ Should we stop? 6 1.5, then NO! ≮
  • 99. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 6) 80 • For any new movie, we check if it should be added to the buffer. After the buffer update, we check the stopping condition. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 6 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5 ⚠ Should we stop? 6 1.5, then NO! ≮
  • 100. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 6) 80 • For any new movie, we check if it should be added to the buffer. After the buffer update, we check the stopping condition. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 6 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5 Given and , should it be added to the buffer? worsecase(r4) = 4.5 bestcase(r4) = 5.75 ⚠ Should we stop? 6 1.5, then NO! ≮
  • 101. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 6) 80 • For any new movie, we check if it should be added to the buffer. After the buffer update, we check the stopping condition. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 6 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r1 1.5 6.5 Candidates 4 r2 1.5 6 5 6 7 8 9 10 mink = ? mink = 1.5 Given and , should it be added to the buffer? worsecase(r4) = 4.5 bestcase(r4) = 5.75 Given that , then YES! worsecase(r4) > mink ⚠ Should we stop? 6 1.5, then NO! ≮
  • 102. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 7) 81 • We update after any buffer update. mink movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 6 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r4 4.5 5.75 Candidates 4 r1 1.5 6.5 5 r2 1.5 6 6 7 8 9 10 mink = ? mink = 1.5
  • 103. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 7) 81 • We update after any buffer update. mink movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 6 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r4 4.5 5.75 Candidates 4 r1 1.5 6.5 5 r2 1.5 6 6 7 8 9 10 mink = ? mink = 1.5 mink = 4.5
  • 104. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event NRA example (step 7) 81 • We update after any buffer update. mink movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list position movie worse-case score best-case score Buffer 1 2 3 Candidates 4 5 6 7 8 9 10 6 SA’s have been performed hitherto. position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r4 4.5 5.75 Candidates 4 r1 1.5 6.5 5 r2 1.5 6 6 7 8 9 10 mink = ? mink = 1.5 mink = 4.5 ⚠ Should we stop? 6.5 4.5, then NO! ≮
  • 105. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r4 4.5 5.75 Candidates 4 r1 1.5 6.5 5 r2 1.5 6 6 7 8 9 10 NRA example (step 8) 82 • We move the cursors sequentially. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 7 SA’s have been performed hitherto. mink = ? mink = 5 ⚠ Should we stop? 6.5 5, then NO! ≮
  • 106. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r4 4.5 5.75 Candidates 4 r1 1.5 6.5 5 r2 1.5 6 6 7 8 9 10 NRA example (step 8) 82 • We move the cursors sequentially. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 7 SA’s have been performed hitherto. mink = ? mink = 5 ⚠ Should we stop? 6.5 5, then NO! ≮
  • 107. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r7 5 6.5 3 r4 4.5 5.75 Candidates 4 r1 1.5 6.5 5 r2 1.5 6 6 7 8 9 10 NRA example (step 8) 82 • We move the cursors sequentially. movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 7 SA’s have been performed hitherto. mink = ? mink = 5 ⚠ Should we stop? 6.5 5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r4 5.5 5.5 3 r7 5 6.5 Candidates 4 r1 1.5 6.5 5 r2 1.5 6 6 7 8 9 10
  • 108. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 9) 83 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 8 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10
  • 109. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 9) 83 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 4.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 8 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10
  • 110. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Some movies may not be worth to be added neither to the buffer nor to the candidates. NRA example (step 10) 84 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 9 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10
  • 111. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Some movies may not be worth to be added neither to the buffer nor to the candidates. NRA example (step 10) 84 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 9 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10
  • 112. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Some movies may not be worth to be added neither to the buffer nor to the candidates. NRA example (step 10) 84 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 9 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r5) = 1 bestcase(r5) = 5.25
  • 113. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Some movies may not be worth to be added neither to the buffer nor to the candidates. NRA example (step 10) 84 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 9 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r5) = 1 bestcase(r5) = 5.25 Given that , then NO! worsecase(r5) ≯ mink
  • 114. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Some movies may not be worth to be added neither to the buffer nor to the candidates. NRA example (step 10) 84 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 9 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r5) = 1 bestcase(r5) = 5.25 Given that , then NO! worsecase(r5) ≯ mink Can we still keep it as a candidate?
  • 115. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • Some movies may not be worth to be added neither to the buffer nor to the candidates. NRA example (step 10) 84 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 9 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 1.5 6.5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r5) = 1 bestcase(r5) = 5.25 Given that , then NO! worsecase(r5) ≯ mink Can we still keep it as a candidate? Given that , then NO! bestcase(r5) ≯ mink
  • 116. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 11) 85 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 10 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10
  • 117. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 11) 85 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 10 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10
  • 118. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 12) 86 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 11 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10
  • 119. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 12) 86 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 11 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10
  • 120. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 12) 86 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 11 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r6) = 0.8 bestcase(r6) = 4.3
  • 121. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 12) 86 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 11 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r6) = 0.8 bestcase(r6) = 4.3 Given that , then NO! worsecase(r6) ≯ mink
  • 122. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 12) 86 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 11 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r6) = 0.8 bestcase(r6) = 4.3 Given that , then NO! worsecase(r6) ≯ mink Can we still keep it as a candidate?
  • 123. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 12) 86 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 11 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10 Given and , should it be added to the buffer? worsecase(r6) = 0.8 bestcase(r6) = 4.3 Given that , then NO! worsecase(r6) ≯ mink Can we still keep it as a candidate? Given that , then NO! bestcase(r6) ≯ mink
  • 124. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 13) 87 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 12 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10
  • 125. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 13) 87 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 12 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10
  • 126. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 13) 87 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 12 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10 We won’t add the movie r9 neither to buffer nor to the candidates.
  • 127. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 14) 88 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 13 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10
  • 128. Exploratory Analysis of User Data: 1st RAIS Summer School May 2021, Online Event • We move the cursors sequentially. NRA example (step 14) 88 movie predicted rating r1 1.5 r2 1.5 r3 1.25 r4 1 r5 1 r6 0.8 r7 0.75 r8 0.75 r9 0.5 r10 0.5 Jacob’s inverted list movie predicted rating r7 5 r3 4.5 r4 4.5 r2 5.25 r1 3.5 r9 3 r5 2 r6 1 r8 0.5 r10 0.5 Julia’s inverted list 13 SA’s have been performed hitherto. mink = ? mink = 5.5 ⚠ Should we stop? 6.5 5.5, then NO! ≮ position movie worse-case score best-case score Buffer 1 r3 5.75 5.75 2 r2 5.75 5.75 3 r4 5.5 5.5 Candidates 4 r7 5 6.5 5 r1 5 5 6 7 8 9 10