SlideShare a Scribd company logo
Collecting and Temporal Analysis of Behavioral Web
Data - Tales from the Inside
TempWeb2024, 13 May 2024
Stefan Dietze
GESIS, HHU & HeiCAD Düsseldorf
What is behavioral web data?
Source: Domo via PCMag
What is behavioral web data?
▪ Social web activity streams (posts, shares, likes, follows etc)
▪ Web search behaviour & SERP (Search Engine Result Pages) interactions
▪ Browsing and navigation behaviour
▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc)
▪ Hard to separate from actual Web content/pages
▪ But: closer to users & their personal (potentially sensitive) information
Why is it important?
▪ Reflects attitudes, leanings, cognitive states, biases
▪ Without understanding behavior, we cannot understand content / data it produces
▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for
ranking algorithms)…
▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated
content that in turn is driven by user interactions)
▪ Central to various research fields in CS concerned with information behavior:
interactive information retrieval, HCI, user modeling, Web mining, etc
Why is it important?
▪ Spawned entirely new research
areas like Computational Social
Science (CSS)
Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data & methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
Challenges: dependencies on 3rd party gatekeepers
Behavioral data is usually tied to specific
platforms, not distributed as the WWW
Challenges: volatility & decay of data
• Data is not persistent
• Example: deletion ratio of tweets
between 25-29 %
• Differs between different samples
Challenges: volatility & decay of data
Challenges: behavioral data is sensitive
Example: AOL release of 20 M search queries (2006)
Challenges: legal restrictions and ethical concerns
▪ Behavioral web data tends to involve sensitive information
▪ Ethical concerns, e.g., when information is taken out of context
▪ Various national and international laws (GDPR etc)
▪ Licensing / legal aspects: Twitter terms of service, copyright, etc.
▪ At the same time: right to archive / research wired into various national legislations
▪ Different constraints for (a) archiving and (b) sharing / using data as well as for
different uses & users (e.g. archival institutions)
▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By
whom?
Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data and methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
15
Range of research concerned with IR & CSS:
▪ Insights, e.g.:
− Understanding information interaction (e.g. during search)
− Spreading of claims and misinformation
− Effect of biased news/claims on public opinion
▪ Computational Methods, e.g.:
− Crawling, harvesting, scraping of data
− Information retrieval & ranking
− Extraction of structured knowledge
(entities, sentiments, stances, claims, etc)
− Classification of search/navigation behavior or users
Found & designed web data for investigating (mis)information behavior
http://gesis.org/en/kts
Found behavioral web data
▪ Data that can be harvested via open APIs or
scraped from the public web over long time
periods and captures real-world online
interactions “found” in the wild
▪ Examples: social web posts/interactions, Twitter/x
data (specifically before API shutdown)
▪ Tends to include data that has been shared
voluntarily by online users, e.g. Twitter users
▪ But: users usually did not provide explicit consent
for secondary use of their data
Case study: Twitter/X
Motivation
Archival perspective:
▪ Ensure long-term archival of volatile information from Twitter
▪ Independence from third-party data access / APIs
Research perspective
▪ Training and evaluating machine learning models (e.g., NER, classification)
▪ Large-scale analyses (e.g., language use, trends)
▪ Facilitate interdisciplinary research on societal online discourse
(e.g. political science, communication science, psychology, sociology)
→ Goal: capture a representative sample of all Twitter data
18
Why real-time collection & preservation of Twitter/X data?
▪ Approx. 28% of tweets deleted over time
▪ Power law distribution: vast majority of tweets is
deleted by small number of users
▪ Prevalent biases in deleted/non-deleted data: anti-
science, conservative and hard-line views more
frequent in deleted tweets
Data decay
19
Why real-time collection & preservation of Twitter/X data?
Model decay due to evolving language & vocabulary
▪ Models & LLMs trained on large volumes of text
▪ Yet: strong vocabulary shift, over-
/underrepresentation of topics/vocabulary in
particular time periods (e.g. Twitter COVID19-
discourse 2020 vs 2019)
▪ LLMs for online discourse analysis require
frequent training and updates (and continuous
access to data)
Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021
Redundant crawls of 1% Twitter stream via Firehose API
20
▪ 14 billion tweets collected between 04/2013 – 05/2023
▪ Largest continuous tweet archive for research purposes
▪ Legal, ethical and licensing constraints (Twitter ToC)
▪ Data sharing via:
o Sensitive data access: facilitating on-prem research on data (e.g. online/offline
secure data centers) or contract-based sharing of sensitive data
o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to
facilitate research
22
TweetsKB – a non-sensitive large-scale archive of societal discourse
▪ Subset of 3 billion prefiltered tweets
(English, spam detection through pretrained classifier)
▪ Sharing of tweet metadata (time stamps, retweet
counts etc), hash tags, user mentions and dedicated
features that capture tweet semantics
(no actual user IDs and full texts)
▪ Features include [CIKM2020, CIKM2022]:
o Disambiguated mentions of entities, linked to
Wikipedia/Dbpedia
(“president”/“potus”/”trump” => dbp:DonaldTrump)
o Sentiment scores (positive/negative emotions)
o Geotags via pretrained DeepGeo model
o Science references/claims [CIKM2022]
https://data.gesis.org/tweetskb
Feature Total Unique % with >= 1 feature
Hashtags: 1,161,839,471 68,832,205 0.19
Mentions: 1,840,456,543 149,277,474 0.38
Entities: 2,563,433,997 2,265,201 0.56
Sentiment: 1,265,974,641 - 0.5
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022
24
https://data.gesis.org/tweetskb
TweetsKB – knowledge graph schema & data access
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Data access via:
▪ SPARQL endpoint/REST API for demos
▪ Download of data dumps (Zenodo, SDN Datorium)
▪ So far approx. 30 K downloads
25
Germany suspends
vaccinations with Astra
Zeneca
Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“
TweetsKB as social science research corpus
Investigating vaccine hesitancy in DACH countries
https://dd4p.gesis.org/
Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review
Germany suspends
vaccinations with Astra
Zeneca
Case: Telegram
26
▪ Telegram channels: public, only admin can post (as opposed to
private groups)
▪ Decentralised: no registry of channels available
▪ Continuous data collection of currently 400 K channels through
snowball sampling (300 seed channels)
▪ Full message history collected for > 10 K channels, approx. 100 M
messages so far
▪ Telegram cross-channel message passing dataset extracted to
support information spreading research, i.e., mis- and
disinformation, hate speech etc
28
Understanding claims & misinformation on the Web: ClaimsKG
Motivation
▪ Claims spread across various (unstructured) fact-checking
sites
▪ Claims and truth ratings evolve over time
▪ Finding claims is hard: e.g. claims about / made by US
republican politicians across the Web?
Approach
▪ Continuous harvesting claims & metadata from fact-
checking sites (e.g. snopes.com, Politifact.com etc);
currently approx. 75.000 claims since 2019
▪ Feature extraction & linking:
o Mentioned entities
o Joint topic classification
o Normalisation of ratings (true, false, mixture, other);
coreference resolution of claims
o Exposing data through established vocabulary and W3C
standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
30
Evolution of claims: frequency & topics
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
31
Evolution of claims: topic biases of fact-check sources
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
32
Stances towards claims / fake news in social media
Motivation
▪ Problem: detecting stance of documents (e.g. social media posts)
towards a given claim (unbalanced class distribution)
▪ Motivation: stance of documents (in particular disagreement) useful
(a) as signal for truthfulness (fake news detection) and (b) Document
or Source classification (e.g. users)
Approach
▪ Cascading binary classifiers: addressing individual issues (e.g.
misclassification costs) per step
▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC,
etc.
▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3)
SVM with class-wise penalty
▪ Experiments on FNC-1 dataset (and FNC baselines)
Results
▪ Minor overall performance improvement
▪ Improvement on disagree class by 27%
(but still far from robust)
A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)
Wrap-up: found data
33
Archival/collection
▪ Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years)
▪ Public APIs, screen-scraping, crawling
Analysis
▪ Heterogeneity and scale of data (example Tweets, query logs)
▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging
▪ Specific research questions usually require dedicated models (no one-size-fits-all approach)
Sharing
▪ Strict constraints (legal, ethical, licensing)
▪ Scalable sharing of sensitive data still unsolved problem
Designed behavioral web data to the rescue
34
▪ Goal: obtain sharable and easy to interpret behavioral web
data through experimental lab studies & quasi-
experiments
▪ Typically involves:
− Artifical settings (e.g. labs),
− Simulation of real-world online scenarios
(e.g. web search)
− Usually less sensitive
− Full consent of participants about data collection &
sharing intentions
− Short time intervals
− Small-scale data (due to costly process)
Case: web search behavior (SAL = „Search As Learning“)
35
Research challenges at the intersection of AI/ML,
HCI & cognitive psychology
▪ Detecting coherent search missions?
▪ Detecting learning throughout search?
detecting “informational” search missions (as
opposed to “transactional” or “navigational”
missions)
▪ How competent is the user? –
Predict/understand knowledge state of users
based on in-session behavior/interactions
▪ How well does a user achieve his/her learning
goal/information need? - Predict knowledge gain
throughout search session
Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on
Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.
Data collection for understanding knowledge gain/state of users
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
Data collection - summary
▪ Crowdsourced collection of search session data
▪ 10 search topics (e.g. “Altitude sickness”,
“Tornados”), incl. pre- and post-tests to assess
user knowledge
▪ Approx. 1000 distinct crowd workers & 100
sessions per topic
▪ Tracking of user behavior through 76 features
in 5 categories (session, query, SERP – search
engine result page, browsing, mouse traces)
Understanding knowledge gain/state of users during web search
37
Some results
▪ 70% of users exhibited a knowledge gain (KG)
▪ Negative relationship between KG of users and
topic popularity (avg. accuracy of workers in
knowledge tests) (R= -.87)
▪ Amount of time users actively spent on web pages
describes 7% of the variance in their KG
▪ Query complexity explains 25% of the variance in
the KG of users
▪ Topic-dependent behavior: search behavior
correlates stronger with search topic than with
KG/KS
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
▪ Same session data as Gadiraju et al., 2018
▪ Stratification of users into classes: user knowledge state (KS)
and knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
▪ Supervised multiclass classification
(Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron)
▪ KG prediction performance results (after 10-fold cross-validation)
▪ Considers in-session features (behavioural traces) only
Predicting knowledge gain/state during web search
38
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Predicting knowledge gain/state during SAL: Features
39
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Behavioral
features
▪ Feature importance (knowledge gain prediction task)
Predicting knowledge gain/state during web search
40
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
▪ Feature importance (knowledge state prediction task)
Predicting knowledge gain/state during web search
41
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Gaze data as additional source of behavioral data in SAL
42
Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search
Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble
Recommendation and Search, @ SIGIR2019, 2019.
Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y.,
Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia
Resource Consumption, 22nd International Conference on Artificial Intelligence in Education
(AIED2021), 2021
▪ Eye gaze data (word-, sentence-, or HTML structure-
level) as additional source of behavioral data
▪ Various studies in SAL context and beyond to
understand topic familiarity, knowledge &
competence or comprehension issues
▪ Usually small study sizes (e.g. 25 < N < 150)
▪ Costly but highly informative features
Facilitating SAL research through public research data
43
https://data.uni-hannover.de/dataset/sal-dataset
Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye
Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).
Case: crowd worker behavior in microtask crowdsourcing
44
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-
selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019.
„Fast Deceiver“
„Competent Worker“
▪ Context: online crowdsourcing tasks widely used to
collect data
▪ Research question: can we classify different worker
types (and detect competent workers) from behavioral
traces alone (mouse movements, scrolling, keystrokes
etc)
▪ Various studies in experimental conditions capturing
wide range of features in various tasks
▪ Low-level behavioural features highly informative when
predicting worker competence and output quality
Wrap-up: found vs designed behavioral data
45
FOUND DATA DESIGNED DATA
As long as gategeepers allow
crawling / scraping
Large & heterogeneous data;
long time intervals;
no „one-size-fits-all“ methods
Sensitive information;
Ethical, legal, licensing constraints
Costly experimental data collection
Homogeneous, small scale data;
short time intervals;
Limited use cases
Full consent of participants;
little sensitive information due to
artifical tasks
Collection
Analysis
Sharing
▪ Behavioral Web Data: crucial ingredient for wide range of research across various disciplines
▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive
information.
▪ Designed Data: collection is costly; limited scale and scope of data.
▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on
− infrastructures for collecting experimental data (e.g. in web search)
− infrastructures for data access (e.g. for tweet archives)
− non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB)
Key take-aways
46
47
http://gesis.org/en/kts
48
@stefandietze
https://stefandietze.net
http://gesis.org/en/kts

More Related Content

Similar to Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Shalin Hai-Jew
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UK
slejay
 
Me and My Big Data Project
Me and My Big Data Project Me and My Big Data Project
Me and My Big Data Project
DIPRC2019
 
Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...
Roberto C. S. Pacheco
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
Symeon Papadopoulos
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Stefan Dietze
 
Big Data and Research Ethics
Big Data and Research EthicsBig Data and Research Ethics
Big Data and Research Ethics
Jan Schmidt
 
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital StrategiesUniv. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
smfrisby
 
Big data and development
Big data and developmentBig data and development
Big data and development
Simone Sala
 
Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22
Thorhildur Jetzek, Ph.D.
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Digital Methods Initiative
 
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
John Makridis
 
Privacy Management and the Social Web
Privacy Management and the Social Web Privacy Management and the Social Web
Privacy Management and the Social Web
Jan Schmidt
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
Axel Bruns
 
CI_for_NA
CI_for_NACI_for_NA
CI_for_NA
webuploader
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Hoang Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Tony Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Young Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Harry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
James Wong
 

Similar to Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside (20)

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UK
 
Me and My Big Data Project
Me and My Big Data Project Me and My Big Data Project
Me and My Big Data Project
 
Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
Big Data and Research Ethics
Big Data and Research EthicsBig Data and Research Ethics
Big Data and Research Ethics
 
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital StrategiesUniv. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
 
Big data and development
Big data and developmentBig data and development
Big data and development
 
Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
 
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
 
Privacy Management and the Social Web
Privacy Management and the Social Web Privacy Management and the Social Web
Privacy Management and the Social Web
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
 
CI_for_NA
CI_for_NACI_for_NA
CI_for_NA
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 

More from Stefan Dietze

Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
NEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge orderNEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
Stefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
Stefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
Stefan Dietze
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
Stefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
Stefan Dietze
 

More from Stefan Dietze (20)

Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...
 
NEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge orderNEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge order
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
 

Recently uploaded

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 

Recently uploaded (20)

Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

  • 1. Collecting and Temporal Analysis of Behavioral Web Data - Tales from the Inside TempWeb2024, 13 May 2024 Stefan Dietze GESIS, HHU & HeiCAD Düsseldorf
  • 2. What is behavioral web data? Source: Domo via PCMag
  • 3. What is behavioral web data? ▪ Social web activity streams (posts, shares, likes, follows etc) ▪ Web search behaviour & SERP (Search Engine Result Pages) interactions ▪ Browsing and navigation behaviour ▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc) ▪ Hard to separate from actual Web content/pages ▪ But: closer to users & their personal (potentially sensitive) information
  • 4. Why is it important? ▪ Reflects attitudes, leanings, cognitive states, biases ▪ Without understanding behavior, we cannot understand content / data it produces ▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for ranking algorithms)… ▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated content that in turn is driven by user interactions) ▪ Central to various research fields in CS concerned with information behavior: interactive information retrieval, HCI, user modeling, Web mining, etc
  • 5. Why is it important? ▪ Spawned entirely new research areas like Computational Social Science (CSS)
  • 6. Overview ▪ Challenges of behavioral web data ▪ Case studies (collecting, sharing, analysis: data & methods) o„Found“ behavioral web data o„Designed“ behavioral web data ▪ Take-aways & outlook
  • 7. Challenges: dependencies on 3rd party gatekeepers Behavioral data is usually tied to specific platforms, not distributed as the WWW
  • 8. Challenges: volatility & decay of data • Data is not persistent • Example: deletion ratio of tweets between 25-29 % • Differs between different samples
  • 9. Challenges: volatility & decay of data
  • 10. Challenges: behavioral data is sensitive Example: AOL release of 20 M search queries (2006)
  • 11. Challenges: legal restrictions and ethical concerns ▪ Behavioral web data tends to involve sensitive information ▪ Ethical concerns, e.g., when information is taken out of context ▪ Various national and international laws (GDPR etc) ▪ Licensing / legal aspects: Twitter terms of service, copyright, etc. ▪ At the same time: right to archive / research wired into various national legislations ▪ Different constraints for (a) archiving and (b) sharing / using data as well as for different uses & users (e.g. archival institutions) ▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By whom?
  • 12. Overview ▪ Challenges of behavioral web data ▪ Case studies (collecting, sharing, analysis: data and methods) o„Found“ behavioral web data o„Designed“ behavioral web data ▪ Take-aways & outlook
  • 13. 15 Range of research concerned with IR & CSS: ▪ Insights, e.g.: − Understanding information interaction (e.g. during search) − Spreading of claims and misinformation − Effect of biased news/claims on public opinion ▪ Computational Methods, e.g.: − Crawling, harvesting, scraping of data − Information retrieval & ranking − Extraction of structured knowledge (entities, sentiments, stances, claims, etc) − Classification of search/navigation behavior or users Found & designed web data for investigating (mis)information behavior http://gesis.org/en/kts
  • 14. Found behavioral web data ▪ Data that can be harvested via open APIs or scraped from the public web over long time periods and captures real-world online interactions “found” in the wild ▪ Examples: social web posts/interactions, Twitter/x data (specifically before API shutdown) ▪ Tends to include data that has been shared voluntarily by online users, e.g. Twitter users ▪ But: users usually did not provide explicit consent for secondary use of their data
  • 15. Case study: Twitter/X Motivation Archival perspective: ▪ Ensure long-term archival of volatile information from Twitter ▪ Independence from third-party data access / APIs Research perspective ▪ Training and evaluating machine learning models (e.g., NER, classification) ▪ Large-scale analyses (e.g., language use, trends) ▪ Facilitate interdisciplinary research on societal online discourse (e.g. political science, communication science, psychology, sociology) → Goal: capture a representative sample of all Twitter data
  • 16. 18 Why real-time collection & preservation of Twitter/X data? ▪ Approx. 28% of tweets deleted over time ▪ Power law distribution: vast majority of tweets is deleted by small number of users ▪ Prevalent biases in deleted/non-deleted data: anti- science, conservative and hard-line views more frequent in deleted tweets Data decay
  • 17. 19 Why real-time collection & preservation of Twitter/X data? Model decay due to evolving language & vocabulary ▪ Models & LLMs trained on large volumes of text ▪ Yet: strong vocabulary shift, over- /underrepresentation of topics/vocabulary in particular time periods (e.g. Twitter COVID19- discourse 2020 vs 2019) ▪ LLMs for online discourse analysis require frequent training and updates (and continuous access to data) Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021
  • 18. Redundant crawls of 1% Twitter stream via Firehose API 20 ▪ 14 billion tweets collected between 04/2013 – 05/2023 ▪ Largest continuous tweet archive for research purposes ▪ Legal, ethical and licensing constraints (Twitter ToC) ▪ Data sharing via: o Sensitive data access: facilitating on-prem research on data (e.g. online/offline secure data centers) or contract-based sharing of sensitive data o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to facilitate research
  • 19. 22 TweetsKB – a non-sensitive large-scale archive of societal discourse ▪ Subset of 3 billion prefiltered tweets (English, spam detection through pretrained classifier) ▪ Sharing of tweet metadata (time stamps, retweet counts etc), hash tags, user mentions and dedicated features that capture tweet semantics (no actual user IDs and full texts) ▪ Features include [CIKM2020, CIKM2022]: o Disambiguated mentions of entities, linked to Wikipedia/Dbpedia (“president”/“potus”/”trump” => dbp:DonaldTrump) o Sentiment scores (positive/negative emotions) o Geotags via pretrained DeepGeo model o Science references/claims [CIKM2022] https://data.gesis.org/tweetskb Feature Total Unique % with >= 1 feature Hashtags: 1,161,839,471 68,832,205 0.19 Mentions: 1,840,456,543 149,277,474 0.38 Entities: 2,563,433,997 2,265,201 0.56 Sentiment: 1,265,974,641 - 0.5 Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020 Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022
  • 20. 24 https://data.gesis.org/tweetskb TweetsKB – knowledge graph schema & data access Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020 Data access via: ▪ SPARQL endpoint/REST API for demos ▪ Download of data dumps (Zenodo, SDN Datorium) ▪ So far approx. 30 K downloads
  • 21. 25 Germany suspends vaccinations with Astra Zeneca Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“ TweetsKB as social science research corpus Investigating vaccine hesitancy in DACH countries https://dd4p.gesis.org/ Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review Germany suspends vaccinations with Astra Zeneca
  • 22. Case: Telegram 26 ▪ Telegram channels: public, only admin can post (as opposed to private groups) ▪ Decentralised: no registry of channels available ▪ Continuous data collection of currently 400 K channels through snowball sampling (300 seed channels) ▪ Full message history collected for > 10 K channels, approx. 100 M messages so far ▪ Telegram cross-channel message passing dataset extracted to support information spreading research, i.e., mis- and disinformation, hate speech etc
  • 23. 28 Understanding claims & misinformation on the Web: ClaimsKG Motivation ▪ Claims spread across various (unstructured) fact-checking sites ▪ Claims and truth ratings evolve over time ▪ Finding claims is hard: e.g. claims about / made by US republican politicians across the Web? Approach ▪ Continuous harvesting claims & metadata from fact- checking sites (e.g. snopes.com, Politifact.com etc); currently approx. 75.000 claims since 2019 ▪ Feature extraction & linking: o Mentioned entities o Joint topic classification o Normalisation of ratings (true, false, mixture, other); coreference resolution of claims o Exposing data through established vocabulary and W3C standards (e.g. SPARQL endpoint) https://data.gesis.org/claimskg/ A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
  • 24. 30 Evolution of claims: frequency & topics https://data.gesis.org/claimskg/ S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
  • 25. 31 Evolution of claims: topic biases of fact-check sources https://data.gesis.org/claimskg/ S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
  • 26. 32 Stances towards claims / fake news in social media Motivation ▪ Problem: detecting stance of documents (e.g. social media posts) towards a given claim (unbalanced class distribution) ▪ Motivation: stance of documents (in particular disagreement) useful (a) as signal for truthfulness (fake news detection) and (b) Document or Source classification (e.g. users) Approach ▪ Cascading binary classifiers: addressing individual issues (e.g. misclassification costs) per step ▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC, etc. ▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty ▪ Experiments on FNC-1 dataset (and FNC baselines) Results ▪ Minor overall performance improvement ▪ Improvement on disagree class by 27% (but still far from robust) A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)
  • 27. Wrap-up: found data 33 Archival/collection ▪ Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years) ▪ Public APIs, screen-scraping, crawling Analysis ▪ Heterogeneity and scale of data (example Tweets, query logs) ▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging ▪ Specific research questions usually require dedicated models (no one-size-fits-all approach) Sharing ▪ Strict constraints (legal, ethical, licensing) ▪ Scalable sharing of sensitive data still unsolved problem
  • 28. Designed behavioral web data to the rescue 34 ▪ Goal: obtain sharable and easy to interpret behavioral web data through experimental lab studies & quasi- experiments ▪ Typically involves: − Artifical settings (e.g. labs), − Simulation of real-world online scenarios (e.g. web search) − Usually less sensitive − Full consent of participants about data collection & sharing intentions − Short time intervals − Small-scale data (due to costly process)
  • 29. Case: web search behavior (SAL = „Search As Learning“) 35 Research challenges at the intersection of AI/ML, HCI & cognitive psychology ▪ Detecting coherent search missions? ▪ Detecting learning throughout search? detecting “informational” search missions (as opposed to “transactional” or “navigational” missions) ▪ How competent is the user? – Predict/understand knowledge state of users based on in-session behavior/interactions ▪ How well does a user achieve his/her learning goal/information need? - Predict knowledge gain throughout search session Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.
  • 30. Data collection for understanding knowledge gain/state of users Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018. Data collection - summary ▪ Crowdsourced collection of search session data ▪ 10 search topics (e.g. “Altitude sickness”, “Tornados”), incl. pre- and post-tests to assess user knowledge ▪ Approx. 1000 distinct crowd workers & 100 sessions per topic ▪ Tracking of user behavior through 76 features in 5 categories (session, query, SERP – search engine result page, browsing, mouse traces)
  • 31. Understanding knowledge gain/state of users during web search 37 Some results ▪ 70% of users exhibited a knowledge gain (KG) ▪ Negative relationship between KG of users and topic popularity (avg. accuracy of workers in knowledge tests) (R= -.87) ▪ Amount of time users actively spent on web pages describes 7% of the variance in their KG ▪ Query complexity explains 25% of the variance in the KG of users ▪ Topic-dependent behavior: search behavior correlates stronger with search topic than with KG/KS Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
  • 32. ▪ Same session data as Gadiraju et al., 2018 ▪ Stratification of users into classes: user knowledge state (KS) and knowledge gain (KG) into {low, moderate, high} using (low < (mean ± 0.5 SD) < high) ▪ Supervised multiclass classification (Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron) ▪ KG prediction performance results (after 10-fold cross-validation) ▪ Considers in-session features (behavioural traces) only Predicting knowledge gain/state during web search 38 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 33. Predicting knowledge gain/state during SAL: Features 39 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018. Behavioral features
  • 34. ▪ Feature importance (knowledge gain prediction task) Predicting knowledge gain/state during web search 40 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 35. ▪ Feature importance (knowledge state prediction task) Predicting knowledge gain/state during web search 41 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 36. Gaze data as additional source of behavioral data in SAL 42 Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble Recommendation and Search, @ SIGIR2019, 2019. Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y., Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption, 22nd International Conference on Artificial Intelligence in Education (AIED2021), 2021 ▪ Eye gaze data (word-, sentence-, or HTML structure- level) as additional source of behavioral data ▪ Various studies in SAL context and beyond to understand topic familiarity, knowledge & competence or comprehension issues ▪ Usually small study sizes (e.g. 25 < N < 150) ▪ Costly but highly informative features
  • 37. Facilitating SAL research through public research data 43 https://data.uni-hannover.de/dataset/sal-dataset Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).
  • 38. Case: crowd worker behavior in microtask crowdsourcing 44 Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015 Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre- selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019. „Fast Deceiver“ „Competent Worker“ ▪ Context: online crowdsourcing tasks widely used to collect data ▪ Research question: can we classify different worker types (and detect competent workers) from behavioral traces alone (mouse movements, scrolling, keystrokes etc) ▪ Various studies in experimental conditions capturing wide range of features in various tasks ▪ Low-level behavioural features highly informative when predicting worker competence and output quality
  • 39. Wrap-up: found vs designed behavioral data 45 FOUND DATA DESIGNED DATA As long as gategeepers allow crawling / scraping Large & heterogeneous data; long time intervals; no „one-size-fits-all“ methods Sensitive information; Ethical, legal, licensing constraints Costly experimental data collection Homogeneous, small scale data; short time intervals; Limited use cases Full consent of participants; little sensitive information due to artifical tasks Collection Analysis Sharing
  • 40. ▪ Behavioral Web Data: crucial ingredient for wide range of research across various disciplines ▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive information. ▪ Designed Data: collection is costly; limited scale and scope of data. ▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on − infrastructures for collecting experimental data (e.g. in web search) − infrastructures for data access (e.g. for tweet archives) − non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB) Key take-aways 46