Collecting and Temporal Analysis of Behavioral Web
Data - Tales from the Inside
TempWeb2024, 13 May 2024
Stefan Dietze
GESIS, HHU & HeiCAD Düsseldorf
What is behavioral web data?
Source: Domo via PCMag
What is behavioral web data?
▪ Social web activity streams (posts, shares, likes, follows etc)
▪ Web search behaviour & SERP (Search Engine Result Pages) interactions
▪ Browsing and navigation behaviour
▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc)
▪ Hard to separate from actual Web content/pages
▪ But: closer to users & their personal (potentially sensitive) information
Why is it important?
▪ Reflects attitudes, leanings, cognitive states, biases
▪ Without understanding behavior, we cannot understand content / data it produces
▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for
ranking algorithms)…
▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated
content that in turn is driven by user interactions)
▪ Central to various research fields in CS concerned with information behavior:
interactive information retrieval, HCI, user modeling, Web mining, etc
Why is it important?
▪ Spawned entirely new research
areas like Computational Social
Science (CSS)
Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data & methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
Challenges: dependencies on 3rd party gatekeepers
Behavioral data is usually tied to specific
platforms, not distributed as the WWW
Challenges: volatility & decay of data
• Data is not persistent
• Example: deletion ratio of tweets
between 25-29 %
• Differs between different samples
Challenges: volatility & decay of data
Challenges: behavioral data is sensitive
Example: AOL release of 20 M search queries (2006)
Challenges: legal restrictions and ethical concerns
▪ Behavioral web data tends to involve sensitive information
▪ Ethical concerns, e.g., when information is taken out of context
▪ Various national and international laws (GDPR etc)
▪ Licensing / legal aspects: Twitter terms of service, copyright, etc.
▪ At the same time: right to archive / research wired into various national legislations
▪ Different constraints for (a) archiving and (b) sharing / using data as well as for
different uses & users (e.g. archival institutions)
▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By
whom?
Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data and methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
15
Range of research concerned with IR & CSS:
▪ Insights, e.g.:
− Understanding information interaction (e.g. during search)
− Spreading of claims and misinformation
− Effect of biased news/claims on public opinion
▪ Computational Methods, e.g.:
− Crawling, harvesting, scraping of data
− Information retrieval & ranking
− Extraction of structured knowledge
(entities, sentiments, stances, claims, etc)
− Classification of search/navigation behavior or users
Found & designed web data for investigating (mis)information behavior
http://gesis.org/en/kts
Found behavioral web data
▪ Data that can be harvested via open APIs or
scraped from the public web over long time
periods and captures real-world online
interactions “found” in the wild
▪ Examples: social web posts/interactions, Twitter/x
data (specifically before API shutdown)
▪ Tends to include data that has been shared
voluntarily by online users, e.g. Twitter users
▪ But: users usually did not provide explicit consent
for secondary use of their data
Case study: Twitter/X
Motivation
Archival perspective:
▪ Ensure long-term archival of volatile information from Twitter
▪ Independence from third-party data access / APIs
Research perspective
▪ Training and evaluating machine learning models (e.g., NER, classification)
▪ Large-scale analyses (e.g., language use, trends)
▪ Facilitate interdisciplinary research on societal online discourse
(e.g. political science, communication science, psychology, sociology)
→ Goal: capture a representative sample of all Twitter data
18
Why real-time collection & preservation of Twitter/X data?
▪ Approx. 28% of tweets deleted over time
▪ Power law distribution: vast majority of tweets is
deleted by small number of users
▪ Prevalent biases in deleted/non-deleted data: anti-
science, conservative and hard-line views more
frequent in deleted tweets
Data decay
19
Why real-time collection & preservation of Twitter/X data?
Model decay due to evolving language & vocabulary
▪ Models & LLMs trained on large volumes of text
▪ Yet: strong vocabulary shift, over-
/underrepresentation of topics/vocabulary in
particular time periods (e.g. Twitter COVID19-
discourse 2020 vs 2019)
▪ LLMs for online discourse analysis require
frequent training and updates (and continuous
access to data)
Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021
Redundant crawls of 1% Twitter stream via Firehose API
20
▪ 14 billion tweets collected between 04/2013 – 05/2023
▪ Largest continuous tweet archive for research purposes
▪ Legal, ethical and licensing constraints (Twitter ToC)
▪ Data sharing via:
o Sensitive data access: facilitating on-prem research on data (e.g. online/offline
secure data centers) or contract-based sharing of sensitive data
o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to
facilitate research
22
TweetsKB – a non-sensitive large-scale archive of societal discourse
▪ Subset of 3 billion prefiltered tweets
(English, spam detection through pretrained classifier)
▪ Sharing of tweet metadata (time stamps, retweet
counts etc), hash tags, user mentions and dedicated
features that capture tweet semantics
(no actual user IDs and full texts)
▪ Features include [CIKM2020, CIKM2022]:
o Disambiguated mentions of entities, linked to
Wikipedia/Dbpedia
(“president”/“potus”/”trump” => dbp:DonaldTrump)
o Sentiment scores (positive/negative emotions)
o Geotags via pretrained DeepGeo model
o Science references/claims [CIKM2022]
https://data.gesis.org/tweetskb
Feature Total Unique % with >= 1 feature
Hashtags: 1,161,839,471 68,832,205 0.19
Mentions: 1,840,456,543 149,277,474 0.38
Entities: 2,563,433,997 2,265,201 0.56
Sentiment: 1,265,974,641 - 0.5
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022
24
https://data.gesis.org/tweetskb
TweetsKB – knowledge graph schema & data access
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Data access via:
▪ SPARQL endpoint/REST API for demos
▪ Download of data dumps (Zenodo, SDN Datorium)
▪ So far approx. 30 K downloads
25
Germany suspends
vaccinations with Astra
Zeneca
Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“
TweetsKB as social science research corpus
Investigating vaccine hesitancy in DACH countries
https://dd4p.gesis.org/
Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review
Germany suspends
vaccinations with Astra
Zeneca
Case: Telegram
26
▪ Telegram channels: public, only admin can post (as opposed to
private groups)
▪ Decentralised: no registry of channels available
▪ Continuous data collection of currently 400 K channels through
snowball sampling (300 seed channels)
▪ Full message history collected for > 10 K channels, approx. 100 M
messages so far
▪ Telegram cross-channel message passing dataset extracted to
support information spreading research, i.e., mis- and
disinformation, hate speech etc
28
Understanding claims & misinformation on the Web: ClaimsKG
Motivation
▪ Claims spread across various (unstructured) fact-checking
sites
▪ Claims and truth ratings evolve over time
▪ Finding claims is hard: e.g. claims about / made by US
republican politicians across the Web?
Approach
▪ Continuous harvesting claims & metadata from fact-
checking sites (e.g. snopes.com, Politifact.com etc);
currently approx. 75.000 claims since 2019
▪ Feature extraction & linking:
o Mentioned entities
o Joint topic classification
o Normalisation of ratings (true, false, mixture, other);
coreference resolution of claims
o Exposing data through established vocabulary and W3C
standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
30
Evolution of claims: frequency & topics
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
31
Evolution of claims: topic biases of fact-check sources
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
32
Stances towards claims / fake news in social media
Motivation
▪ Problem: detecting stance of documents (e.g. social media posts)
towards a given claim (unbalanced class distribution)
▪ Motivation: stance of documents (in particular disagreement) useful
(a) as signal for truthfulness (fake news detection) and (b) Document
or Source classification (e.g. users)
Approach
▪ Cascading binary classifiers: addressing individual issues (e.g.
misclassification costs) per step
▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC,
etc.
▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3)
SVM with class-wise penalty
▪ Experiments on FNC-1 dataset (and FNC baselines)
Results
▪ Minor overall performance improvement
▪ Improvement on disagree class by 27%
(but still far from robust)
A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)
Wrap-up: found data
33
Archival/collection
▪ Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years)
▪ Public APIs, screen-scraping, crawling
Analysis
▪ Heterogeneity and scale of data (example Tweets, query logs)
▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging
▪ Specific research questions usually require dedicated models (no one-size-fits-all approach)
Sharing
▪ Strict constraints (legal, ethical, licensing)
▪ Scalable sharing of sensitive data still unsolved problem
Designed behavioral web data to the rescue
34
▪ Goal: obtain sharable and easy to interpret behavioral web
data through experimental lab studies & quasi-
experiments
▪ Typically involves:
− Artifical settings (e.g. labs),
− Simulation of real-world online scenarios
(e.g. web search)
− Usually less sensitive
− Full consent of participants about data collection &
sharing intentions
− Short time intervals
− Small-scale data (due to costly process)
Case: web search behavior (SAL = „Search As Learning“)
35
Research challenges at the intersection of AI/ML,
HCI & cognitive psychology
▪ Detecting coherent search missions?
▪ Detecting learning throughout search?
detecting “informational” search missions (as
opposed to “transactional” or “navigational”
missions)
▪ How competent is the user? –
Predict/understand knowledge state of users
based on in-session behavior/interactions
▪ How well does a user achieve his/her learning
goal/information need? - Predict knowledge gain
throughout search session
Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on
Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.
Data collection for understanding knowledge gain/state of users
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
Data collection - summary
▪ Crowdsourced collection of search session data
▪ 10 search topics (e.g. “Altitude sickness”,
“Tornados”), incl. pre- and post-tests to assess
user knowledge
▪ Approx. 1000 distinct crowd workers & 100
sessions per topic
▪ Tracking of user behavior through 76 features
in 5 categories (session, query, SERP – search
engine result page, browsing, mouse traces)
Understanding knowledge gain/state of users during web search
37
Some results
▪ 70% of users exhibited a knowledge gain (KG)
▪ Negative relationship between KG of users and
topic popularity (avg. accuracy of workers in
knowledge tests) (R= -.87)
▪ Amount of time users actively spent on web pages
describes 7% of the variance in their KG
▪ Query complexity explains 25% of the variance in
the KG of users
▪ Topic-dependent behavior: search behavior
correlates stronger with search topic than with
KG/KS
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
▪ Same session data as Gadiraju et al., 2018
▪ Stratification of users into classes: user knowledge state (KS)
and knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
▪ Supervised multiclass classification
(Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron)
▪ KG prediction performance results (after 10-fold cross-validation)
▪ Considers in-session features (behavioural traces) only
Predicting knowledge gain/state during web search
38
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Predicting knowledge gain/state during SAL: Features
39
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Behavioral
features
▪ Feature importance (knowledge gain prediction task)
Predicting knowledge gain/state during web search
40
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
▪ Feature importance (knowledge state prediction task)
Predicting knowledge gain/state during web search
41
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Gaze data as additional source of behavioral data in SAL
42
Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search
Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble
Recommendation and Search, @ SIGIR2019, 2019.
Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y.,
Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia
Resource Consumption, 22nd International Conference on Artificial Intelligence in Education
(AIED2021), 2021
▪ Eye gaze data (word-, sentence-, or HTML structure-
level) as additional source of behavioral data
▪ Various studies in SAL context and beyond to
understand topic familiarity, knowledge &
competence or comprehension issues
▪ Usually small study sizes (e.g. 25 < N < 150)
▪ Costly but highly informative features
Facilitating SAL research through public research data
43
https://data.uni-hannover.de/dataset/sal-dataset
Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye
Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).
Case: crowd worker behavior in microtask crowdsourcing
44
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-
selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019.
„Fast Deceiver“
„Competent Worker“
▪ Context: online crowdsourcing tasks widely used to
collect data
▪ Research question: can we classify different worker
types (and detect competent workers) from behavioral
traces alone (mouse movements, scrolling, keystrokes
etc)
▪ Various studies in experimental conditions capturing
wide range of features in various tasks
▪ Low-level behavioural features highly informative when
predicting worker competence and output quality
Wrap-up: found vs designed behavioral data
45
FOUND DATA DESIGNED DATA
As long as gategeepers allow
crawling / scraping
Large & heterogeneous data;
long time intervals;
no „one-size-fits-all“ methods
Sensitive information;
Ethical, legal, licensing constraints
Costly experimental data collection
Homogeneous, small scale data;
short time intervals;
Limited use cases
Full consent of participants;
little sensitive information due to
artifical tasks
Collection
Analysis
Sharing
▪ Behavioral Web Data: crucial ingredient for wide range of research across various disciplines
▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive
information.
▪ Designed Data: collection is costly; limited scale and scope of data.
▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on
− infrastructures for collecting experimental data (e.g. in web search)
− infrastructures for data access (e.g. for tweet archives)
− non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB)
Key take-aways
46
47
http://gesis.org/en/kts
48
@stefandietze
https://stefandietze.net
http://gesis.org/en/kts

Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside

  • 1.
    Collecting and TemporalAnalysis of Behavioral Web Data - Tales from the Inside TempWeb2024, 13 May 2024 Stefan Dietze GESIS, HHU & HeiCAD Düsseldorf
  • 2.
    What is behavioralweb data? Source: Domo via PCMag
  • 3.
    What is behavioralweb data? ▪ Social web activity streams (posts, shares, likes, follows etc) ▪ Web search behaviour & SERP (Search Engine Result Pages) interactions ▪ Browsing and navigation behaviour ▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc) ▪ Hard to separate from actual Web content/pages ▪ But: closer to users & their personal (potentially sensitive) information
  • 4.
    Why is itimportant? ▪ Reflects attitudes, leanings, cognitive states, biases ▪ Without understanding behavior, we cannot understand content / data it produces ▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for ranking algorithms)… ▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated content that in turn is driven by user interactions) ▪ Central to various research fields in CS concerned with information behavior: interactive information retrieval, HCI, user modeling, Web mining, etc
  • 5.
    Why is itimportant? ▪ Spawned entirely new research areas like Computational Social Science (CSS)
  • 6.
    Overview ▪ Challenges ofbehavioral web data ▪ Case studies (collecting, sharing, analysis: data & methods) o„Found“ behavioral web data o„Designed“ behavioral web data ▪ Take-aways & outlook
  • 7.
    Challenges: dependencies on3rd party gatekeepers Behavioral data is usually tied to specific platforms, not distributed as the WWW
  • 8.
    Challenges: volatility &decay of data • Data is not persistent • Example: deletion ratio of tweets between 25-29 % • Differs between different samples
  • 9.
  • 10.
    Challenges: behavioral datais sensitive Example: AOL release of 20 M search queries (2006)
  • 11.
    Challenges: legal restrictionsand ethical concerns ▪ Behavioral web data tends to involve sensitive information ▪ Ethical concerns, e.g., when information is taken out of context ▪ Various national and international laws (GDPR etc) ▪ Licensing / legal aspects: Twitter terms of service, copyright, etc. ▪ At the same time: right to archive / research wired into various national legislations ▪ Different constraints for (a) archiving and (b) sharing / using data as well as for different uses & users (e.g. archival institutions) ▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By whom?
  • 12.
    Overview ▪ Challenges ofbehavioral web data ▪ Case studies (collecting, sharing, analysis: data and methods) o„Found“ behavioral web data o„Designed“ behavioral web data ▪ Take-aways & outlook
  • 13.
    15 Range of researchconcerned with IR & CSS: ▪ Insights, e.g.: − Understanding information interaction (e.g. during search) − Spreading of claims and misinformation − Effect of biased news/claims on public opinion ▪ Computational Methods, e.g.: − Crawling, harvesting, scraping of data − Information retrieval & ranking − Extraction of structured knowledge (entities, sentiments, stances, claims, etc) − Classification of search/navigation behavior or users Found & designed web data for investigating (mis)information behavior http://gesis.org/en/kts
  • 14.
    Found behavioral webdata ▪ Data that can be harvested via open APIs or scraped from the public web over long time periods and captures real-world online interactions “found” in the wild ▪ Examples: social web posts/interactions, Twitter/x data (specifically before API shutdown) ▪ Tends to include data that has been shared voluntarily by online users, e.g. Twitter users ▪ But: users usually did not provide explicit consent for secondary use of their data
  • 15.
    Case study: Twitter/X Motivation Archivalperspective: ▪ Ensure long-term archival of volatile information from Twitter ▪ Independence from third-party data access / APIs Research perspective ▪ Training and evaluating machine learning models (e.g., NER, classification) ▪ Large-scale analyses (e.g., language use, trends) ▪ Facilitate interdisciplinary research on societal online discourse (e.g. political science, communication science, psychology, sociology) → Goal: capture a representative sample of all Twitter data
  • 16.
    18 Why real-time collection& preservation of Twitter/X data? ▪ Approx. 28% of tweets deleted over time ▪ Power law distribution: vast majority of tweets is deleted by small number of users ▪ Prevalent biases in deleted/non-deleted data: anti- science, conservative and hard-line views more frequent in deleted tweets Data decay
  • 17.
    19 Why real-time collection& preservation of Twitter/X data? Model decay due to evolving language & vocabulary ▪ Models & LLMs trained on large volumes of text ▪ Yet: strong vocabulary shift, over- /underrepresentation of topics/vocabulary in particular time periods (e.g. Twitter COVID19- discourse 2020 vs 2019) ▪ LLMs for online discourse analysis require frequent training and updates (and continuous access to data) Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021
  • 18.
    Redundant crawls of1% Twitter stream via Firehose API 20 ▪ 14 billion tweets collected between 04/2013 – 05/2023 ▪ Largest continuous tweet archive for research purposes ▪ Legal, ethical and licensing constraints (Twitter ToC) ▪ Data sharing via: o Sensitive data access: facilitating on-prem research on data (e.g. online/offline secure data centers) or contract-based sharing of sensitive data o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to facilitate research
  • 19.
    22 TweetsKB – anon-sensitive large-scale archive of societal discourse ▪ Subset of 3 billion prefiltered tweets (English, spam detection through pretrained classifier) ▪ Sharing of tweet metadata (time stamps, retweet counts etc), hash tags, user mentions and dedicated features that capture tweet semantics (no actual user IDs and full texts) ▪ Features include [CIKM2020, CIKM2022]: o Disambiguated mentions of entities, linked to Wikipedia/Dbpedia (“president”/“potus”/”trump” => dbp:DonaldTrump) o Sentiment scores (positive/negative emotions) o Geotags via pretrained DeepGeo model o Science references/claims [CIKM2022] https://data.gesis.org/tweetskb Feature Total Unique % with >= 1 feature Hashtags: 1,161,839,471 68,832,205 0.19 Mentions: 1,840,456,543 149,277,474 0.38 Entities: 2,563,433,997 2,265,201 0.56 Sentiment: 1,265,974,641 - 0.5 Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020 Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022
  • 20.
    24 https://data.gesis.org/tweetskb TweetsKB – knowledgegraph schema & data access Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020 Data access via: ▪ SPARQL endpoint/REST API for demos ▪ Download of data dumps (Zenodo, SDN Datorium) ▪ So far approx. 30 K downloads
  • 21.
    25 Germany suspends vaccinations withAstra Zeneca Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“ TweetsKB as social science research corpus Investigating vaccine hesitancy in DACH countries https://dd4p.gesis.org/ Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review Germany suspends vaccinations with Astra Zeneca
  • 22.
    Case: Telegram 26 ▪ Telegramchannels: public, only admin can post (as opposed to private groups) ▪ Decentralised: no registry of channels available ▪ Continuous data collection of currently 400 K channels through snowball sampling (300 seed channels) ▪ Full message history collected for > 10 K channels, approx. 100 M messages so far ▪ Telegram cross-channel message passing dataset extracted to support information spreading research, i.e., mis- and disinformation, hate speech etc
  • 23.
    28 Understanding claims &misinformation on the Web: ClaimsKG Motivation ▪ Claims spread across various (unstructured) fact-checking sites ▪ Claims and truth ratings evolve over time ▪ Finding claims is hard: e.g. claims about / made by US republican politicians across the Web? Approach ▪ Continuous harvesting claims & metadata from fact- checking sites (e.g. snopes.com, Politifact.com etc); currently approx. 75.000 claims since 2019 ▪ Feature extraction & linking: o Mentioned entities o Joint topic classification o Normalisation of ratings (true, false, mixture, other); coreference resolution of claims o Exposing data through established vocabulary and W3C standards (e.g. SPARQL endpoint) https://data.gesis.org/claimskg/ A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
  • 24.
    30 Evolution of claims:frequency & topics https://data.gesis.org/claimskg/ S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
  • 25.
    31 Evolution of claims:topic biases of fact-check sources https://data.gesis.org/claimskg/ S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
  • 26.
    32 Stances towards claims/ fake news in social media Motivation ▪ Problem: detecting stance of documents (e.g. social media posts) towards a given claim (unbalanced class distribution) ▪ Motivation: stance of documents (in particular disagreement) useful (a) as signal for truthfulness (fake news detection) and (b) Document or Source classification (e.g. users) Approach ▪ Cascading binary classifiers: addressing individual issues (e.g. misclassification costs) per step ▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC, etc. ▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3) SVM with class-wise penalty ▪ Experiments on FNC-1 dataset (and FNC baselines) Results ▪ Minor overall performance improvement ▪ Improvement on disagree class by 27% (but still far from robust) A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)
  • 27.
    Wrap-up: found data 33 Archival/collection ▪Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years) ▪ Public APIs, screen-scraping, crawling Analysis ▪ Heterogeneity and scale of data (example Tweets, query logs) ▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging ▪ Specific research questions usually require dedicated models (no one-size-fits-all approach) Sharing ▪ Strict constraints (legal, ethical, licensing) ▪ Scalable sharing of sensitive data still unsolved problem
  • 28.
    Designed behavioral webdata to the rescue 34 ▪ Goal: obtain sharable and easy to interpret behavioral web data through experimental lab studies & quasi- experiments ▪ Typically involves: − Artifical settings (e.g. labs), − Simulation of real-world online scenarios (e.g. web search) − Usually less sensitive − Full consent of participants about data collection & sharing intentions − Short time intervals − Small-scale data (due to costly process)
  • 29.
    Case: web searchbehavior (SAL = „Search As Learning“) 35 Research challenges at the intersection of AI/ML, HCI & cognitive psychology ▪ Detecting coherent search missions? ▪ Detecting learning throughout search? detecting “informational” search missions (as opposed to “transactional” or “navigational” missions) ▪ How competent is the user? – Predict/understand knowledge state of users based on in-session behavior/interactions ▪ How well does a user achieve his/her learning goal/information need? - Predict knowledge gain throughout search session Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.
  • 30.
    Data collection forunderstanding knowledge gain/state of users Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018. Data collection - summary ▪ Crowdsourced collection of search session data ▪ 10 search topics (e.g. “Altitude sickness”, “Tornados”), incl. pre- and post-tests to assess user knowledge ▪ Approx. 1000 distinct crowd workers & 100 sessions per topic ▪ Tracking of user behavior through 76 features in 5 categories (session, query, SERP – search engine result page, browsing, mouse traces)
  • 31.
    Understanding knowledge gain/stateof users during web search 37 Some results ▪ 70% of users exhibited a knowledge gain (KG) ▪ Negative relationship between KG of users and topic popularity (avg. accuracy of workers in knowledge tests) (R= -.87) ▪ Amount of time users actively spent on web pages describes 7% of the variance in their KG ▪ Query complexity explains 25% of the variance in the KG of users ▪ Topic-dependent behavior: search behavior correlates stronger with search topic than with KG/KS Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
  • 32.
    ▪ Same sessiondata as Gadiraju et al., 2018 ▪ Stratification of users into classes: user knowledge state (KS) and knowledge gain (KG) into {low, moderate, high} using (low < (mean ± 0.5 SD) < high) ▪ Supervised multiclass classification (Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron) ▪ KG prediction performance results (after 10-fold cross-validation) ▪ Considers in-session features (behavioural traces) only Predicting knowledge gain/state during web search 38 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 33.
    Predicting knowledge gain/stateduring SAL: Features 39 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018. Behavioral features
  • 34.
    ▪ Feature importance(knowledge gain prediction task) Predicting knowledge gain/state during web search 40 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 35.
    ▪ Feature importance(knowledge state prediction task) Predicting knowledge gain/state during web search 41 Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
  • 36.
    Gaze data asadditional source of behavioral data in SAL 42 Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble Recommendation and Search, @ SIGIR2019, 2019. Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y., Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption, 22nd International Conference on Artificial Intelligence in Education (AIED2021), 2021 ▪ Eye gaze data (word-, sentence-, or HTML structure- level) as additional source of behavioral data ▪ Various studies in SAL context and beyond to understand topic familiarity, knowledge & competence or comprehension issues ▪ Usually small study sizes (e.g. 25 < N < 150) ▪ Costly but highly informative features
  • 37.
    Facilitating SAL researchthrough public research data 43 https://data.uni-hannover.de/dataset/sal-dataset Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).
  • 38.
    Case: crowd workerbehavior in microtask crowdsourcing 44 Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015 Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre- selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019. „Fast Deceiver“ „Competent Worker“ ▪ Context: online crowdsourcing tasks widely used to collect data ▪ Research question: can we classify different worker types (and detect competent workers) from behavioral traces alone (mouse movements, scrolling, keystrokes etc) ▪ Various studies in experimental conditions capturing wide range of features in various tasks ▪ Low-level behavioural features highly informative when predicting worker competence and output quality
  • 39.
    Wrap-up: found vsdesigned behavioral data 45 FOUND DATA DESIGNED DATA As long as gategeepers allow crawling / scraping Large & heterogeneous data; long time intervals; no „one-size-fits-all“ methods Sensitive information; Ethical, legal, licensing constraints Costly experimental data collection Homogeneous, small scale data; short time intervals; Limited use cases Full consent of participants; little sensitive information due to artifical tasks Collection Analysis Sharing
  • 40.
    ▪ Behavioral WebData: crucial ingredient for wide range of research across various disciplines ▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive information. ▪ Designed Data: collection is costly; limited scale and scope of data. ▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on − infrastructures for collecting experimental data (e.g. in web search) − infrastructures for data access (e.g. for tweet archives) − non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB) Key take-aways 46
  • 41.
  • 42.