Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

Collecting and Temporal Analysis of Behavioral Web
Data - Tales from the Inside
TempWeb2024, 13 May 2024
Stefan Dietze
GESIS, HHU & HeiCAD Düsseldorf

What is behavioral web data?
Source: Domo via PCMag

What is behavioral web data?
▪ Social web activity streams (posts, shares, likes, follows etc)
▪ Web search behaviour & SERP (Search Engine Result Pages) interactions
▪ Browsing and navigation behaviour
▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc)
▪ Hard to separate from actual Web content/pages
▪ But: closer to users & their personal (potentially sensitive) information

Why is it important?
▪ Reflects attitudes, leanings, cognitive states, biases
▪ Without understanding behavior, we cannot understand content / data it produces
▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for
ranking algorithms)…
▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated
content that in turn is driven by user interactions)
▪ Central to various research fields in CS concerned with information behavior:
interactive information retrieval, HCI, user modeling, Web mining, etc

Why is it important?
▪ Spawned entirely new research
areas like Computational Social
Science (CSS)

Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data & methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook

Challenges: dependencies on 3rd party gatekeepers
Behavioral data is usually tied to specific
platforms, not distributed as the WWW

Challenges: volatility & decay of data
• Data is not persistent
• Example: deletion ratio of tweets
between 25-29 %
• Differs between different samples

Challenges: volatility & decay of data

Challenges: behavioral data is sensitive
Example: AOL release of 20 M search queries (2006)

Challenges: legal restrictions and ethical concerns
▪ Behavioral web data tends to involve sensitive information
▪ Ethical concerns, e.g., when information is taken out of context
▪ Various national and international laws (GDPR etc)
▪ Licensing / legal aspects: Twitter terms of service, copyright, etc.
▪ At the same time: right to archive / research wired into various national legislations
▪ Different constraints for (a) archiving and (b) sharing / using data as well as for
different uses & users (e.g. archival institutions)
▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By
whom?

Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data and methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook

15
Range of research concerned with IR & CSS:
▪ Insights, e.g.:
− Understanding information interaction (e.g. during search)
− Spreading of claims and misinformation
− Effect of biased news/claims on public opinion
▪ Computational Methods, e.g.:
− Crawling, harvesting, scraping of data
− Information retrieval & ranking
− Extraction of structured knowledge
(entities, sentiments, stances, claims, etc)
− Classification of search/navigation behavior or users
Found & designed web data for investigating (mis)information behavior
http://gesis.org/en/kts

Found behavioral web data
▪ Data that can be harvested via open APIs or
scraped from the public web over long time
periods and captures real-world online
interactions “found” in the wild
▪ Examples: social web posts/interactions, Twitter/x
data (specifically before API shutdown)
▪ Tends to include data that has been shared
voluntarily by online users, e.g. Twitter users
▪ But: users usually did not provide explicit consent
for secondary use of their data

Case study: Twitter/X
Motivation
Archival perspective:
▪ Ensure long-term archival of volatile information from Twitter
▪ Independence from third-party data access / APIs
Research perspective
▪ Training and evaluating machine learning models (e.g., NER, classification)
▪ Large-scale analyses (e.g., language use, trends)
▪ Facilitate interdisciplinary research on societal online discourse
(e.g. political science, communication science, psychology, sociology)
→ Goal: capture a representative sample of all Twitter data

18
Why real-time collection & preservation of Twitter/X data?
▪ Approx. 28% of tweets deleted over time
▪ Power law distribution: vast majority of tweets is
deleted by small number of users
▪ Prevalent biases in deleted/non-deleted data: anti-
science, conservative and hard-line views more
frequent in deleted tweets
Data decay

19
Why real-time collection & preservation of Twitter/X data?
Model decay due to evolving language & vocabulary
▪ Models & LLMs trained on large volumes of text
▪ Yet: strong vocabulary shift, over-
/underrepresentation of topics/vocabulary in
particular time periods (e.g. Twitter COVID19-
discourse 2020 vs 2019)
▪ LLMs for online discourse analysis require
frequent training and updates (and continuous
access to data)
Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021

Redundant crawls of 1% Twitter stream via Firehose API
20
▪ 14 billion tweets collected between 04/2013 – 05/2023
▪ Largest continuous tweet archive for research purposes
▪ Legal, ethical and licensing constraints (Twitter ToC)
▪ Data sharing via:
o Sensitive data access: facilitating on-prem research on data (e.g. online/offline
secure data centers) or contract-based sharing of sensitive data
o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to
facilitate research

22
TweetsKB – a non-sensitive large-scale archive of societal discourse
▪ Subset of 3 billion prefiltered tweets
(English, spam detection through pretrained classifier)
▪ Sharing of tweet metadata (time stamps, retweet
counts etc), hash tags, user mentions and dedicated
features that capture tweet semantics
(no actual user IDs and full texts)
▪ Features include [CIKM2020, CIKM2022]:
o Disambiguated mentions of entities, linked to
Wikipedia/Dbpedia
(“president”/“potus”/”trump” => dbp:DonaldTrump)
o Sentiment scores (positive/negative emotions)
o Geotags via pretrained DeepGeo model
o Science references/claims [CIKM2022]
https://data.gesis.org/tweetskb
Feature Total Unique % with >= 1 feature
Hashtags: 1,161,839,471 68,832,205 0.19
Mentions: 1,840,456,543 149,277,474 0.38
Entities: 2,563,433,997 2,265,201 0.56
Sentiment: 1,265,974,641 - 0.5
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022

24
https://data.gesis.org/tweetskb
TweetsKB – knowledge graph schema & data access
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Data access via:
▪ SPARQL endpoint/REST API for demos
▪ Download of data dumps (Zenodo, SDN Datorium)
▪ So far approx. 30 K downloads

25
Germany suspends
vaccinations with Astra
Zeneca
Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“
TweetsKB as social science research corpus
Investigating vaccine hesitancy in DACH countries
https://dd4p.gesis.org/
Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review
Germany suspends
vaccinations with Astra
Zeneca

Case: Telegram
26
▪ Telegram channels: public, only admin can post (as opposed to
private groups)
▪ Decentralised: no registry of channels available
▪ Continuous data collection of currently 400 K channels through
snowball sampling (300 seed channels)
▪ Full message history collected for > 10 K channels, approx. 100 M
messages so far
▪ Telegram cross-channel message passing dataset extracted to
support information spreading research, i.e., mis- and
disinformation, hate speech etc

28
Understanding claims & misinformation on the Web: ClaimsKG
Motivation
▪ Claims spread across various (unstructured) fact-checking
sites
▪ Claims and truth ratings evolve over time
▪ Finding claims is hard: e.g. claims about / made by US
republican politicians across the Web?
Approach
▪ Continuous harvesting claims & metadata from fact-
checking sites (e.g. snopes.com, Politifact.com etc);
currently approx. 75.000 claims since 2019
▪ Feature extraction & linking:
o Mentioned entities
o Joint topic classification
o Normalisation of ratings (true, false, mixture, other);
coreference resolution of claims
o Exposing data through established vocabulary and W3C
standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019

30
Evolution of claims: frequency & topics
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)

31
Evolution of claims: topic biases of fact-check sources
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)

32
Stances towards claims / fake news in social media
Motivation
▪ Problem: detecting stance of documents (e.g. social media posts)
towards a given claim (unbalanced class distribution)
▪ Motivation: stance of documents (in particular disagreement) useful
(a) as signal for truthfulness (fake news detection) and (b) Document
or Source classification (e.g. users)
Approach
▪ Cascading binary classifiers: addressing individual issues (e.g.
misclassification costs) per step
▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC,
etc.
▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3)
SVM with class-wise penalty
▪ Experiments on FNC-1 dataset (and FNC baselines)
Results
▪ Minor overall performance improvement
▪ Improvement on disagree class by 27%
(but still far from robust)
A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)

Wrap-up: found data
33
Archival/collection
▪ Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years)
▪ Public APIs, screen-scraping, crawling
Analysis
▪ Heterogeneity and scale of data (example Tweets, query logs)
▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging
▪ Specific research questions usually require dedicated models (no one-size-fits-all approach)
Sharing
▪ Strict constraints (legal, ethical, licensing)
▪ Scalable sharing of sensitive data still unsolved problem

Designed behavioral web data to the rescue
34
▪ Goal: obtain sharable and easy to interpret behavioral web
data through experimental lab studies & quasi-
experiments
▪ Typically involves:
− Artifical settings (e.g. labs),
− Simulation of real-world online scenarios
(e.g. web search)
− Usually less sensitive
− Full consent of participants about data collection &
sharing intentions
− Short time intervals
− Small-scale data (due to costly process)

Case: web search behavior (SAL = „Search As Learning“)
35
Research challenges at the intersection of AI/ML,
HCI & cognitive psychology
▪ Detecting coherent search missions?
▪ Detecting learning throughout search?
detecting “informational” search missions (as
opposed to “transactional” or “navigational”
missions)
▪ How competent is the user? –
Predict/understand knowledge state of users
based on in-session behavior/interactions
▪ How well does a user achieve his/her learning
goal/information need? - Predict knowledge gain
throughout search session
Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on
Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.

Data collection for understanding knowledge gain/state of users
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
Data collection - summary
▪ Crowdsourced collection of search session data
▪ 10 search topics (e.g. “Altitude sickness”,
“Tornados”), incl. pre- and post-tests to assess
user knowledge
▪ Approx. 1000 distinct crowd workers & 100
sessions per topic
▪ Tracking of user behavior through 76 features
in 5 categories (session, query, SERP – search
engine result page, browsing, mouse traces)

Understanding knowledge gain/state of users during web search
37
Some results
▪ 70% of users exhibited a knowledge gain (KG)
▪ Negative relationship between KG of users and
topic popularity (avg. accuracy of workers in
knowledge tests) (R= -.87)
▪ Amount of time users actively spent on web pages
describes 7% of the variance in their KG
▪ Query complexity explains 25% of the variance in
the KG of users
▪ Topic-dependent behavior: search behavior
correlates stronger with search topic than with
KG/KS
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.

▪ Same session data as Gadiraju et al., 2018
▪ Stratification of users into classes: user knowledge state (KS)
and knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
▪ Supervised multiclass classification
(Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron)
▪ KG prediction performance results (after 10-fold cross-validation)
▪ Considers in-session features (behavioural traces) only
Predicting knowledge gain/state during web search
38
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.

Predicting knowledge gain/state during SAL: Features
39
Behavioral
features

▪ Feature importance (knowledge gain prediction task)
40

▪ Feature importance (knowledge state prediction task)
41

Gaze data as additional source of behavioral data in SAL
42
Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search
Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble
Recommendation and Search, @ SIGIR2019, 2019.
Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y.,
Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia
Resource Consumption, 22nd International Conference on Artificial Intelligence in Education
(AIED2021), 2021
▪ Eye gaze data (word-, sentence-, or HTML structure-
level) as additional source of behavioral data
▪ Various studies in SAL context and beyond to
understand topic familiarity, knowledge &
competence or comprehension issues
▪ Usually small study sizes (e.g. 25 < N < 150)
▪ Costly but highly informative features

Facilitating SAL research through public research data
43
https://data.uni-hannover.de/dataset/sal-dataset
Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye
Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).

Case: crowd worker behavior in microtask crowdsourcing
44
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-
selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019.
„Fast Deceiver“
„Competent Worker“
▪ Context: online crowdsourcing tasks widely used to
collect data
▪ Research question: can we classify different worker
types (and detect competent workers) from behavioral
traces alone (mouse movements, scrolling, keystrokes
etc)
▪ Various studies in experimental conditions capturing
wide range of features in various tasks
▪ Low-level behavioural features highly informative when
predicting worker competence and output quality

Wrap-up: found vs designed behavioral data
45
FOUND DATA DESIGNED DATA
As long as gategeepers allow
crawling / scraping
Large & heterogeneous data;
long time intervals;
no „one-size-fits-all“ methods
Sensitive information;
Ethical, legal, licensing constraints
Costly experimental data collection
Homogeneous, small scale data;
short time intervals;
Limited use cases
Full consent of participants;
little sensitive information due to
artifical tasks
Collection
Analysis
Sharing

▪ Behavioral Web Data: crucial ingredient for wide range of research across various disciplines
▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive
information.
▪ Designed Data: collection is costly; limited scale and scope of data.
▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on
− infrastructures for collecting experimental data (e.g. in web search)
− infrastructures for data access (e.g. for tweet archives)
− non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB)
Key take-aways
46

48
@stefandietze
https://stefandietze.net
http://gesis.org/en/kts

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

Recommended

Recommended

More Related Content

Similar to Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

Similar to Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside (20)

More from Stefan Dietze

More from Stefan Dietze (20)

Recently uploaded

Recently uploaded (20)

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside