SlideShare a Scribd company logo
Collecting and Temporal Analysis of Behavioral Web
Data - Tales from the Inside
TempWeb2024, 13 May 2024
Stefan Dietze
GESIS, HHU & HeiCAD Düsseldorf
What is behavioral web data?
Source: Domo via PCMag
What is behavioral web data?
▪ Social web activity streams (posts, shares, likes, follows etc)
▪ Web search behaviour & SERP (Search Engine Result Pages) interactions
▪ Browsing and navigation behaviour
▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc)
▪ Hard to separate from actual Web content/pages
▪ But: closer to users & their personal (potentially sensitive) information
Why is it important?
▪ Reflects attitudes, leanings, cognitive states, biases
▪ Without understanding behavior, we cannot understand content / data it produces
▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for
ranking algorithms)…
▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated
content that in turn is driven by user interactions)
▪ Central to various research fields in CS concerned with information behavior:
interactive information retrieval, HCI, user modeling, Web mining, etc
Why is it important?
▪ Spawned entirely new research
areas like Computational Social
Science (CSS)
Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data & methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
Challenges: dependencies on 3rd party gatekeepers
Behavioral data is usually tied to specific
platforms, not distributed as the WWW
Challenges: volatility & decay of data
• Data is not persistent
• Example: deletion ratio of tweets
between 25-29 %
• Differs between different samples
Challenges: volatility & decay of data
Challenges: behavioral data is sensitive
Example: AOL release of 20 M search queries (2006)
Challenges: legal restrictions and ethical concerns
▪ Behavioral web data tends to involve sensitive information
▪ Ethical concerns, e.g., when information is taken out of context
▪ Various national and international laws (GDPR etc)
▪ Licensing / legal aspects: Twitter terms of service, copyright, etc.
▪ At the same time: right to archive / research wired into various national legislations
▪ Different constraints for (a) archiving and (b) sharing / using data as well as for
different uses & users (e.g. archival institutions)
▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By
whom?
Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data and methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
15
Range of research concerned with IR & CSS:
▪ Insights, e.g.:
− Understanding information interaction (e.g. during search)
− Spreading of claims and misinformation
− Effect of biased news/claims on public opinion
▪ Computational Methods, e.g.:
− Crawling, harvesting, scraping of data
− Information retrieval & ranking
− Extraction of structured knowledge
(entities, sentiments, stances, claims, etc)
− Classification of search/navigation behavior or users
Found & designed web data for investigating (mis)information behavior
http://gesis.org/en/kts
Found behavioral web data
▪ Data that can be harvested via open APIs or
scraped from the public web over long time
periods and captures real-world online
interactions “found” in the wild
▪ Examples: social web posts/interactions, Twitter/x
data (specifically before API shutdown)
▪ Tends to include data that has been shared
voluntarily by online users, e.g. Twitter users
▪ But: users usually did not provide explicit consent
for secondary use of their data
Case study: Twitter/X
Motivation
Archival perspective:
▪ Ensure long-term archival of volatile information from Twitter
▪ Independence from third-party data access / APIs
Research perspective
▪ Training and evaluating machine learning models (e.g., NER, classification)
▪ Large-scale analyses (e.g., language use, trends)
▪ Facilitate interdisciplinary research on societal online discourse
(e.g. political science, communication science, psychology, sociology)
→ Goal: capture a representative sample of all Twitter data
18
Why real-time collection & preservation of Twitter/X data?
▪ Approx. 28% of tweets deleted over time
▪ Power law distribution: vast majority of tweets is
deleted by small number of users
▪ Prevalent biases in deleted/non-deleted data: anti-
science, conservative and hard-line views more
frequent in deleted tweets
Data decay
19
Why real-time collection & preservation of Twitter/X data?
Model decay due to evolving language & vocabulary
▪ Models & LLMs trained on large volumes of text
▪ Yet: strong vocabulary shift, over-
/underrepresentation of topics/vocabulary in
particular time periods (e.g. Twitter COVID19-
discourse 2020 vs 2019)
▪ LLMs for online discourse analysis require
frequent training and updates (and continuous
access to data)
Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021
Redundant crawls of 1% Twitter stream via Firehose API
20
▪ 14 billion tweets collected between 04/2013 – 05/2023
▪ Largest continuous tweet archive for research purposes
▪ Legal, ethical and licensing constraints (Twitter ToC)
▪ Data sharing via:
o Sensitive data access: facilitating on-prem research on data (e.g. online/offline
secure data centers) or contract-based sharing of sensitive data
o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to
facilitate research
22
TweetsKB – a non-sensitive large-scale archive of societal discourse
▪ Subset of 3 billion prefiltered tweets
(English, spam detection through pretrained classifier)
▪ Sharing of tweet metadata (time stamps, retweet
counts etc), hash tags, user mentions and dedicated
features that capture tweet semantics
(no actual user IDs and full texts)
▪ Features include [CIKM2020, CIKM2022]:
o Disambiguated mentions of entities, linked to
Wikipedia/Dbpedia
(“president”/“potus”/”trump” => dbp:DonaldTrump)
o Sentiment scores (positive/negative emotions)
o Geotags via pretrained DeepGeo model
o Science references/claims [CIKM2022]
https://data.gesis.org/tweetskb
Feature Total Unique % with >= 1 feature
Hashtags: 1,161,839,471 68,832,205 0.19
Mentions: 1,840,456,543 149,277,474 0.38
Entities: 2,563,433,997 2,265,201 0.56
Sentiment: 1,265,974,641 - 0.5
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022
24
https://data.gesis.org/tweetskb
TweetsKB – knowledge graph schema & data access
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Data access via:
▪ SPARQL endpoint/REST API for demos
▪ Download of data dumps (Zenodo, SDN Datorium)
▪ So far approx. 30 K downloads
25
Germany suspends
vaccinations with Astra
Zeneca
Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“
TweetsKB as social science research corpus
Investigating vaccine hesitancy in DACH countries
https://dd4p.gesis.org/
Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review
Germany suspends
vaccinations with Astra
Zeneca
Case: Telegram
26
▪ Telegram channels: public, only admin can post (as opposed to
private groups)
▪ Decentralised: no registry of channels available
▪ Continuous data collection of currently 400 K channels through
snowball sampling (300 seed channels)
▪ Full message history collected for > 10 K channels, approx. 100 M
messages so far
▪ Telegram cross-channel message passing dataset extracted to
support information spreading research, i.e., mis- and
disinformation, hate speech etc
28
Understanding claims & misinformation on the Web: ClaimsKG
Motivation
▪ Claims spread across various (unstructured) fact-checking
sites
▪ Claims and truth ratings evolve over time
▪ Finding claims is hard: e.g. claims about / made by US
republican politicians across the Web?
Approach
▪ Continuous harvesting claims & metadata from fact-
checking sites (e.g. snopes.com, Politifact.com etc);
currently approx. 75.000 claims since 2019
▪ Feature extraction & linking:
o Mentioned entities
o Joint topic classification
o Normalisation of ratings (true, false, mixture, other);
coreference resolution of claims
o Exposing data through established vocabulary and W3C
standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
30
Evolution of claims: frequency & topics
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
31
Evolution of claims: topic biases of fact-check sources
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
32
Stances towards claims / fake news in social media
Motivation
▪ Problem: detecting stance of documents (e.g. social media posts)
towards a given claim (unbalanced class distribution)
▪ Motivation: stance of documents (in particular disagreement) useful
(a) as signal for truthfulness (fake news detection) and (b) Document
or Source classification (e.g. users)
Approach
▪ Cascading binary classifiers: addressing individual issues (e.g.
misclassification costs) per step
▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC,
etc.
▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3)
SVM with class-wise penalty
▪ Experiments on FNC-1 dataset (and FNC baselines)
Results
▪ Minor overall performance improvement
▪ Improvement on disagree class by 27%
(but still far from robust)
A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)
Wrap-up: found data
33
Archival/collection
▪ Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years)
▪ Public APIs, screen-scraping, crawling
Analysis
▪ Heterogeneity and scale of data (example Tweets, query logs)
▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging
▪ Specific research questions usually require dedicated models (no one-size-fits-all approach)
Sharing
▪ Strict constraints (legal, ethical, licensing)
▪ Scalable sharing of sensitive data still unsolved problem
Designed behavioral web data to the rescue
34
▪ Goal: obtain sharable and easy to interpret behavioral web
data through experimental lab studies & quasi-
experiments
▪ Typically involves:
− Artifical settings (e.g. labs),
− Simulation of real-world online scenarios
(e.g. web search)
− Usually less sensitive
− Full consent of participants about data collection &
sharing intentions
− Short time intervals
− Small-scale data (due to costly process)
Case: web search behavior (SAL = „Search As Learning“)
35
Research challenges at the intersection of AI/ML,
HCI & cognitive psychology
▪ Detecting coherent search missions?
▪ Detecting learning throughout search?
detecting “informational” search missions (as
opposed to “transactional” or “navigational”
missions)
▪ How competent is the user? –
Predict/understand knowledge state of users
based on in-session behavior/interactions
▪ How well does a user achieve his/her learning
goal/information need? - Predict knowledge gain
throughout search session
Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on
Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.
Data collection for understanding knowledge gain/state of users
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
Data collection - summary
▪ Crowdsourced collection of search session data
▪ 10 search topics (e.g. “Altitude sickness”,
“Tornados”), incl. pre- and post-tests to assess
user knowledge
▪ Approx. 1000 distinct crowd workers & 100
sessions per topic
▪ Tracking of user behavior through 76 features
in 5 categories (session, query, SERP – search
engine result page, browsing, mouse traces)
Understanding knowledge gain/state of users during web search
37
Some results
▪ 70% of users exhibited a knowledge gain (KG)
▪ Negative relationship between KG of users and
topic popularity (avg. accuracy of workers in
knowledge tests) (R= -.87)
▪ Amount of time users actively spent on web pages
describes 7% of the variance in their KG
▪ Query complexity explains 25% of the variance in
the KG of users
▪ Topic-dependent behavior: search behavior
correlates stronger with search topic than with
KG/KS
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
▪ Same session data as Gadiraju et al., 2018
▪ Stratification of users into classes: user knowledge state (KS)
and knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
▪ Supervised multiclass classification
(Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron)
▪ KG prediction performance results (after 10-fold cross-validation)
▪ Considers in-session features (behavioural traces) only
Predicting knowledge gain/state during web search
38
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Predicting knowledge gain/state during SAL: Features
39
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Behavioral
features
▪ Feature importance (knowledge gain prediction task)
Predicting knowledge gain/state during web search
40
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
▪ Feature importance (knowledge state prediction task)
Predicting knowledge gain/state during web search
41
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Gaze data as additional source of behavioral data in SAL
42
Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search
Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble
Recommendation and Search, @ SIGIR2019, 2019.
Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y.,
Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia
Resource Consumption, 22nd International Conference on Artificial Intelligence in Education
(AIED2021), 2021
▪ Eye gaze data (word-, sentence-, or HTML structure-
level) as additional source of behavioral data
▪ Various studies in SAL context and beyond to
understand topic familiarity, knowledge &
competence or comprehension issues
▪ Usually small study sizes (e.g. 25 < N < 150)
▪ Costly but highly informative features
Facilitating SAL research through public research data
43
https://data.uni-hannover.de/dataset/sal-dataset
Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye
Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).
Case: crowd worker behavior in microtask crowdsourcing
44
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-
selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019.
„Fast Deceiver“
„Competent Worker“
▪ Context: online crowdsourcing tasks widely used to
collect data
▪ Research question: can we classify different worker
types (and detect competent workers) from behavioral
traces alone (mouse movements, scrolling, keystrokes
etc)
▪ Various studies in experimental conditions capturing
wide range of features in various tasks
▪ Low-level behavioural features highly informative when
predicting worker competence and output quality
Wrap-up: found vs designed behavioral data
45
FOUND DATA DESIGNED DATA
As long as gategeepers allow
crawling / scraping
Large & heterogeneous data;
long time intervals;
no „one-size-fits-all“ methods
Sensitive information;
Ethical, legal, licensing constraints
Costly experimental data collection
Homogeneous, small scale data;
short time intervals;
Limited use cases
Full consent of participants;
little sensitive information due to
artifical tasks
Collection
Analysis
Sharing
▪ Behavioral Web Data: crucial ingredient for wide range of research across various disciplines
▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive
information.
▪ Designed Data: collection is costly; limited scale and scope of data.
▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on
− infrastructures for collecting experimental data (e.g. in web search)
− infrastructures for data access (e.g. for tweet archives)
− non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB)
Key take-aways
46
47
http://gesis.org/en/kts
48
@stefandietze
https://stefandietze.net
http://gesis.org/en/kts

More Related Content

Similar to Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Shalin Hai-Jew
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UK
slejay
 
Me and My Big Data Project
Me and My Big Data Project Me and My Big Data Project
Me and My Big Data Project
DIPRC2019
 
Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...
Roberto C. S. Pacheco
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
Symeon Papadopoulos
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Stefan Dietze
 
Big Data and Research Ethics
Big Data and Research EthicsBig Data and Research Ethics
Big Data and Research Ethics
Jan Schmidt
 
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital StrategiesUniv. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
smfrisby
 
Big data and development
Big data and developmentBig data and development
Big data and development
Simone Sala
 
Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22
Thorhildur Jetzek, Ph.D.
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Digital Methods Initiative
 
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
John Makridis
 
Privacy Management and the Social Web
Privacy Management and the Social Web Privacy Management and the Social Web
Privacy Management and the Social Web
Jan Schmidt
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
Axel Bruns
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Hoang Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Tony Nguyen
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Young Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
Harry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
James Wong
 

Similar to Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside (20)

Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods:  Extracting So...
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...
 
Introduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UKIntroduction MA Data, Culture and Society | University of Westminster, UK
Introduction MA Data, Culture and Society | University of Westminster, UK
 
Me and My Big Data Project
Me and My Big Data Project Me and My Big Data Project
Me and My Big Data Project
 
Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...Knowledge Engineering, Electronic Government and the applications to Scientom...
Knowledge Engineering, Electronic Government and the applications to Scientom...
 
Predicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online DiscussionsPredicting News Popularity by Mining Online Discussions
Predicting News Popularity by Mining Online Discussions
 
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
Human-in-the-loop: the Web as Foundation for interdisciplinary Data Science M...
 
Big Data and Research Ethics
Big Data and Research EthicsBig Data and Research Ethics
Big Data and Research Ethics
 
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital StrategiesUniv. of AZ Global Racing Symposium 2015 - Digital Strategies
Univ. of AZ Global Racing Symposium 2015 - Digital Strategies
 
Big data and development
Big data and developmentBig data and development
Big data and development
 
Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22 Big data presentation for University of Reykjavik, Iceland, March 22
Big data presentation for University of Reykjavik, Iceland, March 22
 
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
Cross-Platform Profiling tutorial at the Digital Methods Summer School 2013
 
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
Big Data Analytics and Knowledge Discovery through Location-Based Social Netw...
 
Privacy Management and the Social Web
Privacy Management and the Social Web Privacy Management and the Social Web
Privacy Management and the Social Web
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
 
CI_for_NA
CI_for_NACI_for_NA
CI_for_NA
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 

More from Stefan Dietze

Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...
Stefan Dietze
 
NEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge orderNEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge order
Stefan Dietze
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
Stefan Dietze
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
Stefan Dietze
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
Stefan Dietze
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Stefan Dietze
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
Stefan Dietze
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Stefan Dietze
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
Stefan Dietze
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
Stefan Dietze
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
Stefan Dietze
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
Stefan Dietze
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Stefan Dietze
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
Stefan Dietze
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Stefan Dietze
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
Stefan Dietze
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
Stefan Dietze
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
Stefan Dietze
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
Stefan Dietze
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
Stefan Dietze
 

More from Stefan Dietze (20)

Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...Understanding Scientific and Societal Adoption and Impact of Science Through ...
Understanding Scientific and Societal Adoption and Impact of Science Through ...
 
NEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge orderNEWORDER Project - Science in the online knowledge order
NEWORDER Project - Science in the online knowledge order
 
An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...An interdisciplinary journey with the SAL spaceship – results and challenges ...
An interdisciplinary journey with the SAL spaceship – results and challenges ...
 
Research Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESISResearch Knowledge Graphs at NFDI4DS & GESIS
Research Knowledge Graphs at NFDI4DS & GESIS
 
Research Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScienceResearch Knowledge Graphs at GESIS & NFDI4DataScience
Research Knowledge Graphs at GESIS & NFDI4DataScience
 
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
Human-in-the-Loop: das Web als Grundlage interdisziplinärer Data Science Meth...
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...Beyond research data infrastructures: exploiting artificial & crowd intellige...
Beyond research data infrastructures: exploiting artificial & crowd intellige...
 
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...
 
Using AI to understand everyday learning on the Web
Using AI to understand everyday learning on the WebUsing AI to understand everyday learning on the Web
Using AI to understand everyday learning on the Web
 
Analysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online ActivitiesAnalysing User Knowledge, Competence and Learning during Online Activities
Analysing User Knowledge, Competence and Learning during Online Activities
 
Analysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the WebAnalysing & Improving Learning Resources Markup on the Web
Analysing & Improving Learning Resources Markup on the Web
 
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the WebBeyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
Beyond Linked Data - Exploiting Entity-Centric Knowledge on the Web
 
Big Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday LearningBig Data in Learning Analytics - Analytics for Everyday Learning
Big Data in Learning Analytics - Analytics for Everyday Learning
 
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the WebRetrieval, Crawling and Fusion of Entity-centric Data on the Web
Retrieval, Crawling and Fusion of Entity-centric Data on the Web
 
Mining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the WebMining and Understanding Activities and Resources on the Web
Mining and Understanding Activities and Resources on the Web
 
Towards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the WebTowards embedded Markup of Learning Resources on the Web
Towards embedded Markup of Learning Resources on the Web
 
Semantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital LibrariesSemantic Linking & Retrieval for Digital Libraries
Semantic Linking & Retrieval for Digital Libraries
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Dietze linked data-vr-es
Dietze linked data-vr-esDietze linked data-vr-es
Dietze linked data-vr-es
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

  • 1. Collecting and Temporal Analysis of Behavioral Web Data - Tales from the Inside TempWeb2024, 13 May 2024 Stefan Dietze GESIS, HHU & HeiCAD Düsseldorf
  • 2. What is behavioral web data? Source: Domo via PCMag
  • 3. What is behavioral web data? ▪ Social web activity streams (posts, shares, likes, follows etc) ▪ Web search behaviour & SERP (Search Engine Result Pages) interactions ▪ Browsing and navigation behaviour ▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc) ▪ Hard to separate from actual Web content/pages ▪ But: closer to users & their personal (potentially sensitive) information
  • 4. Why is it important? ▪ Reflects attitudes, leanings, cognitive states, biases ▪ Without understanding behavior, we cannot understand content / data it produces ▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for ranking algorithms)… ▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated content that in turn is driven by user interactions) ▪ Central to various research fields in CS concerned with information behavior: interactive information retrieval, HCI, user modeling, Web mining, etc
  • 5. Why is it important? ▪ Spawned entirely new research areas like Computational Social Science (CSS)
  • 6. Overview ▪ Challenges of behavioral web data ▪ Case studies (collecting, sharing, analysis: data & methods) o„Found“ behavioral web data o„Designed“ behavioral web data ▪ Take-aways & outlook
  • 7. Challenges: dependencies on 3rd party gatekeepers Behavioral data is usually tied to specific platforms, not distributed as the WWW
  • 8. Challenges: volatility & decay of data • Data is not persistent • Example: deletion ratio of tweets between 25-29 % • Differs between different samples
  • 9. Challenges: volatility & decay of data
  • 10. Challenges: behavioral data is sensitive Example: AOL release of 20 M search queries (2006)
  • 11. Challenges: legal restrictions and ethical concerns ▪ Behavioral web data tends to involve sensitive information ▪ Ethical concerns, e.g., when information is taken out of context ▪ Various national and international laws (GDPR etc) ▪ Licensing / legal aspects: Twitter terms of service, copyright, etc. ▪ At the same time: right to archive / research wired into various national legislations ▪ Different constraints for (a) archiving and (b) sharing / using data as well as for different uses & users (e.g. archival institutions) ▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By whom?
  • 12. Overview ▪ Challenges of behavioral web data ▪ Case studies (collecting, sharing, analysis: data and methods) o„Found“ behavioral web data o„Designed“ behavioral web data ▪ Take-aways & outlook
  • 13. 15 Range of research concerned with IR & CSS: ▪ Insights, e.g.: − Understanding information interaction (e.g. during search) − Spreading of claims and misinformation − Effect of biased news/claims on public opinion ▪ Computational Methods, e.g.: − Crawling, harvesting, scraping of data − Information retrieval & ranking − Extraction of structured knowledge (entities, sentiments, stances, claims, etc) − Classification of search/navigation behavior or users Found & designed web data for investigating (mis)information behavior http://gesis.org/en/kts
  • 14. Found behavioral web data ▪ Data that can be harvested via open APIs or scraped from the public web over long time periods and captures real-world online interactions “found” in the wild ▪ Examples: social web posts/interactions, Twitter/x data (specifically before API shutdown) ▪ Tends to include data that has been shared voluntarily by online users, e.g. Twitter users ▪ But: users usually did not provide explicit consent for secondary use of their data
  • 15. Case study: Twitter/X Motivation Archival perspective: ▪ Ensure long-term archival of volatile information from Twitter ▪ Independence from third-party data access / APIs Research perspective ▪ Training and evaluating machine learning models (e.g., NER, classification) ▪ Large-scale analyses (e.g., language use, trends) ▪ Facilitate interdisciplinary research on societal online discourse (e.g. political science, communication science, psychology, sociology) → Goal: capture a representative sample of all Twitter data
  • 16. 18 Why real-time collection & preservation of Twitter/X data? ▪ Approx. 28% of tweets deleted over time ▪ Power law distribution: vast majority of tweets is deleted by small number of users ▪ Prevalent biases in deleted/non-deleted data: anti- science, conservative and hard-line views more frequent in deleted tweets Data decay
  • 17. 19 Why real-time collection & preservation of Twitter/X data? Model decay due to evolving language & vocabulary ▪ Models & LLMs trained on large volumes of text ▪ Yet: strong vocabulary shift, over- /underrepresentation of topics/vocabulary in particular time periods (e.g. Twitter COVID19- discourse 2020 vs 2019) ▪ LLMs for online discourse analysis require frequent training and updates (and continuous access to data) Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021
  • 18. Redundant crawls of 1% Twitter stream via Firehose API 20 ▪ 14 billion tweets collected between 04/2013 – 05/2023 ▪ Largest continuous tweet archive for research purposes ▪ Legal, ethical and licensing constraints (Twitter ToC) ▪ Data sharing via: o Sensitive data access: facilitating on-prem research on data (e.g. online/offline secure data centers) or contract-based sharing of sensitive data o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to facilitate research
  • 19. 22 TweetsKB – a non-sensitive large-scale archive of societal discourse ▪ Subset of 3 billion prefiltered tweets (English, spam detection through pretrained classifier) ▪ Sharing of tweet metadata (time stamps, retweet counts etc), hash tags, user mentions and dedicated features that capture tweet semantics (no actual user IDs and full texts) ▪ Features include [CIKM2020, CIKM2022]: o Disambiguated mentions of entities, linked to Wikipedia/Dbpedia (“president”/“potus”/”trump” => dbp:DonaldTrump) o Sentiment scores (positive/negative emotions) o Geotags via pretrained DeepGeo model o Science references/claims [CIKM2022] https://data.gesis.org/tweetskb Feature Total Unique % with >= 1 feature Hashtags: 1,161,839,471 68,832,205 0.19 Mentions: 1,840,456,543 149,277,474 0.38 Entities: 2,563,433,997 2,265,201 0.56 Sentiment: 1,265,974,641 - 0.5 Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020 Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022
  • 20. 24 https://data.gesis.org/tweetskb TweetsKB – knowledge graph schema & data access Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020 Data access via: ▪ SPARQL endpoint/REST API for demos ▪ Download of data dumps (Zenodo, SDN Datorium) ▪ So far approx. 30 K downloads
  • 21. 25 Germany suspends vaccinations with Astra Zeneca Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“ TweetsKB as social science research corpus Investigating vaccine hesitancy in DACH countries https://dd4p.gesis.org/ Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review Germany suspends vaccinations with Astra Zeneca
  • 22. Case: Telegram 26 ▪ Telegram channels: public, only admin can post (as opposed to private groups) ▪ Decentralised: no registry of channels available ▪ Continuous data collection of currently 400 K channels through snowball sampling (300 seed channels) ▪ Full message history collected for > 10 K channels, approx. 100 M messages so far ▪ Telegram cross-channel message passing dataset extracted to support information spreading research, i.e., mis- and disinformation, hate speech etc
  • 23. 28 Understanding claims & misinformation on the Web: ClaimsKG Motivation ▪ Claims spread across various (unstructured) fact-checking sites ▪ Claims and truth ratings evolve over time ▪ Finding claims is hard: e.g. claims about / made by US republican politicians across the Web? Approach ▪ Continuous harvesting claims & metadata from fact- checking sites (e.g. snopes.com, Politifact.com etc); currently approx. 75.000 claims since 2019 ▪ Feature extraction & linking: o Mentioned entities o Joint topic classification o Normalisation of ratings (true, false, mixture, other); coreference resolution of claims o Exposing data through established vocabulary and W3C standards (e.g. SPARQL endpoint) https://data.gesis.org/claimskg/ A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
  • 24. 30 Evolution of claims: frequency & topics https://data.gesis.org/claimskg/ S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
  • 25. 31 Evolution of claims: topic biases of fact-check sources https://data.gesis.org/claimskg/ S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
  • 26. 32 Stances towards claims / fake news in social media Motivation ▪ Problem: detecting stance of documents (e.g. social media posts) towards a given claim (unbalanced class distribution) ▪ Motivation: stance of documents (in particular disagreement) useful (a) as signal for truthfulness (fake news detection) and (b) Document or Source classification (e.g. users) Approach ▪ Cascading binary classifiers: addressing individual issues (e.g. misclassification costs) per step ▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC, etc. ▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty ▪ Experiments on FNC-1 dataset (and FNC baselines) Results ▪ Minor overall performance improvement ▪ Improvement on disagree class by 27% (but still far from robust) A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)
  • 27. Wrap-up: found data 33 Archival/collection ▪ Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years) ▪ Public APIs, screen-scraping, crawling Analysis ▪ Heterogeneity and scale of data (example Tweets, query logs) ▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging ▪ Specific research questions usually require dedicated models (no one-size-fits-all approach) Sharing ▪ Strict constraints (legal, ethical, licensing) ▪ Scalable sharing of sensitive data still unsolved problem
  • 28. Designed behavioral web data to the rescue 34 ▪ Goal: obtain sharable and easy to interpret behavioral web data through experimental lab studies & quasi- experiments ▪ Typically involves: − Artifical settings (e.g. labs), − Simulation of real-world online scenarios (e.g. web search) − Usually less sensitive − Full consent of participants about data collection & sharing intentions − Short time intervals − Small-scale data (due to costly process)
  • 29. Case: web search behavior (SAL = „Search As Learning“) 35 Research challenges at the intersection of AI/ML, HCI & cognitive psychology ▪ Detecting coherent search missions? ▪ Detecting learning throughout search? detecting “informational” search missions (as opposed to “transactional” or “navigational” missions) ▪ How competent is the user? – Predict/understand knowledge state of users based on in-session behavior/interactions ▪ How well does a user achieve his/her learning goal/information need? - Predict knowledge gain throughout search session Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.
  • 30. Data collection for understanding knowledge gain/state of users Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018. Data collection - summary ▪ Crowdsourced collection of search session data ▪ 10 search topics (e.g. “Altitude sickness”, “Tornados”), incl. pre- and post-tests to assess user knowledge ▪ Approx. 1000 distinct crowd workers & 100 sessions per topic ▪ Tracking of user behavior through 76 features in 5 categories (session, query, SERP – search engine result page, browsing, mouse traces)
  • 31. Understanding knowledge gain/state of users during web search 37 Some results ▪ 70% of users exhibited a knowledge gain (KG) ▪ Negative relationship between KG of users and topic popularity (avg. accuracy of workers in knowledge tests) (R= -.87) ▪ Amount of time users actively spent on web pages describes 7% of the variance in their KG ▪ Query complexity explains 25% of the variance in the KG of users ▪ Topic-dependent behavior: search behavior correlates stronger with search topic than with KG/KS Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
  • 32. ▪ Same session data as Gadiraju et al., 2018 ▪ Stratification of users into classes: user knowledge state (KS) and knowledge gain (KG) into {low, moderate, high} using (low < (mean ± 0.5 SD) < high) ▪ Supervised multiclass classification (Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron) ▪ KG prediction performance results (after 10-fold cross-validation) ▪ Considers in-session features (behavioural traces) only Predicting knowledge gain/state during web search 38 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 33. Predicting knowledge gain/state during SAL: Features 39 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018. Behavioral features
  • 34. ▪ Feature importance (knowledge gain prediction task) Predicting knowledge gain/state during web search 40 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 35. ▪ Feature importance (knowledge state prediction task) Predicting knowledge gain/state during web search 41 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 36. Gaze data as additional source of behavioral data in SAL 42 Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble Recommendation and Search, @ SIGIR2019, 2019. Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y., Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption, 22nd International Conference on Artificial Intelligence in Education (AIED2021), 2021 ▪ Eye gaze data (word-, sentence-, or HTML structure- level) as additional source of behavioral data ▪ Various studies in SAL context and beyond to understand topic familiarity, knowledge & competence or comprehension issues ▪ Usually small study sizes (e.g. 25 < N < 150) ▪ Costly but highly informative features
  • 37. Facilitating SAL research through public research data 43 https://data.uni-hannover.de/dataset/sal-dataset Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).
  • 38. Case: crowd worker behavior in microtask crowdsourcing 44 Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015 Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre- selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019. „Fast Deceiver“ „Competent Worker“ ▪ Context: online crowdsourcing tasks widely used to collect data ▪ Research question: can we classify different worker types (and detect competent workers) from behavioral traces alone (mouse movements, scrolling, keystrokes etc) ▪ Various studies in experimental conditions capturing wide range of features in various tasks ▪ Low-level behavioural features highly informative when predicting worker competence and output quality
  • 39. Wrap-up: found vs designed behavioral data 45 FOUND DATA DESIGNED DATA As long as gategeepers allow crawling / scraping Large & heterogeneous data; long time intervals; no „one-size-fits-all“ methods Sensitive information; Ethical, legal, licensing constraints Costly experimental data collection Homogeneous, small scale data; short time intervals; Limited use cases Full consent of participants; little sensitive information due to artifical tasks Collection Analysis Sharing
  • 40. ▪ Behavioral Web Data: crucial ingredient for wide range of research across various disciplines ▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive information. ▪ Designed Data: collection is costly; limited scale and scope of data. ▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on − infrastructures for collecting experimental data (e.g. in web search) − infrastructures for data access (e.g. for tweet archives) − non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB) Key take-aways 46