AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
Talk at Bonn University on general AI and NLP challenges in the context of online discourse analysis. Specific focus on challenges arising from the widespread adoption of neural large language models.
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
Research analysts go to Twitter to capture the general trends of public conversations, identify and profile influential accounts, and extract subgroups within larger collectives and larger discourses; they also go to eavesdrop on individual self-talk and individual-to-individual conversations. So what is technically in your tweets, asked Dave Rosenberg famously in a CNET article (2010). The answer: a whole lot more than 140 characters. How are the most influential social media accounts identified through #hashtag graphs? How are themes extracted? How are sentiments understood? How can users be profiled through their Tweetstreams? How can locations be mapped in terms of the Twitter conversations occurring in particular physical areas? How can live and trending issues be identified and categorized in terms of sentiment (positive, negative, and neutral)? This presentation will summarize some of the free and open-source tools as well as commercial and proprietary ones that enable increased knowability.
Researching Social Media – Big Data and Social Media AnalysisFarida Vis
Researching Social Media – Big Data and Social Media Analysis, presentation for the Social Media for Researchers: A Sheffield Universities Social Media Symposium, 23 September 2014
Working with Social Media Data: Ethics & good practice around collecting, usi...Nicola Osborne
Slides from a workshop delivered for the University of Edinburgh Digital Scholarship programme, on 18th October 2017. For further information on the programme see: http://www.digital.cahss.ed.ac.uk/ or #DigScholEd. If you are interested in hosting a similar workshop, or adapting these slides please contact me: nicola.osborne@ed.ac.uk.
This is a brief a brief review of current multi-disciplinary and collaborative projects at Kno.e.sis led by Prof. Amit Sheth. They cover research in big social data, IoT, semantic web, semantic sensor web, health informatics, personalized digital health, social data for social good, smart city, crisis informatics, digital data for material genome initiative, etc. Dec 2015 edition.
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
Talk at Bonn University on general AI and NLP challenges in the context of online discourse analysis. Specific focus on challenges arising from the widespread adoption of neural large language models.
Eavesdropping on the Twitter Microblogging SiteShalin Hai-Jew
Research analysts go to Twitter to capture the general trends of public conversations, identify and profile influential accounts, and extract subgroups within larger collectives and larger discourses; they also go to eavesdrop on individual self-talk and individual-to-individual conversations. So what is technically in your tweets, asked Dave Rosenberg famously in a CNET article (2010). The answer: a whole lot more than 140 characters. How are the most influential social media accounts identified through #hashtag graphs? How are themes extracted? How are sentiments understood? How can users be profiled through their Tweetstreams? How can locations be mapped in terms of the Twitter conversations occurring in particular physical areas? How can live and trending issues be identified and categorized in terms of sentiment (positive, negative, and neutral)? This presentation will summarize some of the free and open-source tools as well as commercial and proprietary ones that enable increased knowability.
Researching Social Media – Big Data and Social Media AnalysisFarida Vis
Researching Social Media – Big Data and Social Media Analysis, presentation for the Social Media for Researchers: A Sheffield Universities Social Media Symposium, 23 September 2014
Working with Social Media Data: Ethics & good practice around collecting, usi...Nicola Osborne
Slides from a workshop delivered for the University of Edinburgh Digital Scholarship programme, on 18th October 2017. For further information on the programme see: http://www.digital.cahss.ed.ac.uk/ or #DigScholEd. If you are interested in hosting a similar workshop, or adapting these slides please contact me: nicola.osborne@ed.ac.uk.
This is a brief a brief review of current multi-disciplinary and collaborative projects at Kno.e.sis led by Prof. Amit Sheth. They cover research in big social data, IoT, semantic web, semantic sensor web, health informatics, personalized digital health, social data for social good, smart city, crisis informatics, digital data for material genome initiative, etc. Dec 2015 edition.
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...Shalin Hai-Jew
This introduces methods for extracting and analyzing social network data from Twitter for hashtag conversations (and emergent events), event graphs, search networks, and user ego neighborhoods (using NodeXL). There will be direct demonstrations and discussions of how to analyze social network graphs. This information may be extended with human- and / or machine-based sentiment analysis.
Introduction MA Data, Culture and Society | University of Westminster, UKslejay
Datafication, the transformation of our everyday lives into digital data, poses great risks and opportunities for contemporary societies. This new MA course addresses, explores and researches this transformation. Industries increasingly rely on big data and dataficiation. Students therefore need analytical and practical skills to work with data in various sectors. The interdisciplinary course combines hands-on and applied approaches with theoretical learning. It encourages collaboration, group work and problem-based learning. Students will learn about analytical approaches to big data, algorithms, the Internet of Things, artificial intelligence, blockchain and other cutting-edge technologies. We will discuss and explore what the implications of such technologies for identities, politics, the economy and societies are.
Students will also be introduced to practical skills when it comes to the use, analysis and visualisation of data (such as data/text mining, social network analysis, digital discourse analysis, digital ethnography, sentiment analysis, geospatial analysis). Graduates from this programme will be fully capable and confident to combine these skills during their careers. Students who complete the MA Data, Culture and Society can work in a wide variety of sectors connected to data and the media and creative industries.
More information:
https://www.westminster.ac.uk/computer-science-and-software-engineering-journalism-and-mass-communication-courses/2019-20/september/full-time/data-culture-and-society-ma
Knowledge Engineering, Electronic Government and the applications to Scientom...Roberto C. S. Pacheco
Presentation at 2nd International Meeting on Science, Technology and Innovation Indicators, organized by KAWAX, in Santiago - Chile (17 and january, Chile).
Input Presentation at the „Computational Communication Science: Towards a Strategic Roadmap” conference in Hannover (http://ccsconf.com/), 15th Feb 2018
Univ. of AZ Global Racing Symposium 2015 - Digital Strategiessmfrisby
Provides a high-level view of how organizations can leverage Big Data in the digital space. Covers topics such as structured vs unstructured data, curating disparate data sources and exploiting the data correlation opportunities.
The presentatio offers an overview on big data in/for global development - i.e. how big data & data science are being developed in emerging and developing regions.
It is divided in three main sections:
(1) what is big data (as of today) & what is big data in/for development?
(2) Who is actually doing «big data for development»? Who are the main intrnational actors/stakeholders? What are main experiences?
(3) Why are we doing this? - i.e. are we doing this right? What are the main access, capacity / interpretation / ethical issues?
Understanding Scientific and Societal Adoption and Impact of Science Through ...Stefan Dietze
Keynote on analysing scholarly discourse at Second International Workshop on Semantic Technologies and Deep Learning Models for Scientific, Technical and Legal Data SemTech4STLD, held on 26 May at ESWC2024
Hashtag Conversations,Eventgraphs, and User Ego Neighborhoods: Extracting So...Shalin Hai-Jew
This introduces methods for extracting and analyzing social network data from Twitter for hashtag conversations (and emergent events), event graphs, search networks, and user ego neighborhoods (using NodeXL). There will be direct demonstrations and discussions of how to analyze social network graphs. This information may be extended with human- and / or machine-based sentiment analysis.
Introduction MA Data, Culture and Society | University of Westminster, UKslejay
Datafication, the transformation of our everyday lives into digital data, poses great risks and opportunities for contemporary societies. This new MA course addresses, explores and researches this transformation. Industries increasingly rely on big data and dataficiation. Students therefore need analytical and practical skills to work with data in various sectors. The interdisciplinary course combines hands-on and applied approaches with theoretical learning. It encourages collaboration, group work and problem-based learning. Students will learn about analytical approaches to big data, algorithms, the Internet of Things, artificial intelligence, blockchain and other cutting-edge technologies. We will discuss and explore what the implications of such technologies for identities, politics, the economy and societies are.
Students will also be introduced to practical skills when it comes to the use, analysis and visualisation of data (such as data/text mining, social network analysis, digital discourse analysis, digital ethnography, sentiment analysis, geospatial analysis). Graduates from this programme will be fully capable and confident to combine these skills during their careers. Students who complete the MA Data, Culture and Society can work in a wide variety of sectors connected to data and the media and creative industries.
More information:
https://www.westminster.ac.uk/computer-science-and-software-engineering-journalism-and-mass-communication-courses/2019-20/september/full-time/data-culture-and-society-ma
Knowledge Engineering, Electronic Government and the applications to Scientom...Roberto C. S. Pacheco
Presentation at 2nd International Meeting on Science, Technology and Innovation Indicators, organized by KAWAX, in Santiago - Chile (17 and january, Chile).
Input Presentation at the „Computational Communication Science: Towards a Strategic Roadmap” conference in Hannover (http://ccsconf.com/), 15th Feb 2018
Univ. of AZ Global Racing Symposium 2015 - Digital Strategiessmfrisby
Provides a high-level view of how organizations can leverage Big Data in the digital space. Covers topics such as structured vs unstructured data, curating disparate data sources and exploiting the data correlation opportunities.
The presentatio offers an overview on big data in/for global development - i.e. how big data & data science are being developed in emerging and developing regions.
It is divided in three main sections:
(1) what is big data (as of today) & what is big data in/for development?
(2) Who is actually doing «big data for development»? Who are the main intrnational actors/stakeholders? What are main experiences?
(3) Why are we doing this? - i.e. are we doing this right? What are the main access, capacity / interpretation / ethical issues?
Understanding Scientific and Societal Adoption and Impact of Science Through ...Stefan Dietze
Keynote on analysing scholarly discourse at Second International Workshop on Semantic Technologies and Deep Learning Models for Scientific, Technical and Legal Data SemTech4STLD, held on 26 May at ESWC2024
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
Keynote at HELMeTO2022 conference, Palermo, Italy on recent research in Search As Learning (SAL), at the intersection of machine learning and cognitive psychology.
From Web Data to Knowledge: on the Complementarity of Human and Artificial In...Stefan Dietze
Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.
Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
Research talk given at Italian National Research Council (CNR), Institute for Educational Technologies (ITD) on learning analytics in everyday online activities.
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.
Mining and Understanding Activities and Resources on the WebStefan Dietze
Research Seminar at KMRC Tübingen, Germany, on mining and understanding of Web acivities and resources through knowledge discovery and machine learning approaches.
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
An overview of recent works on entitiy linking and retrieval in large corpora, specifically bibliographic data. The works address both traditional Linked Data and knowledge graphs as well as data extracted from Web markup, such as the Web Data Commons.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Monitoring Java Application Security with JDK Tools and JFR Events
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
1. Collecting and Temporal Analysis of Behavioral Web
Data - Tales from the Inside
TempWeb2024, 13 May 2024
Stefan Dietze
GESIS, HHU & HeiCAD Düsseldorf
3. What is behavioral web data?
▪ Social web activity streams (posts, shares, likes, follows etc)
▪ Web search behaviour & SERP (Search Engine Result Pages) interactions
▪ Browsing and navigation behaviour
▪ Low-level behavioral traces (scrolling, mouse movements, gaze behavior etc)
▪ Hard to separate from actual Web content/pages
▪ But: closer to users & their personal (potentially sensitive) information
4. Why is it important?
▪ Reflects attitudes, leanings, cognitive states, biases
▪ Without understanding behavior, we cannot understand content / data it produces
▪ Majority of algorithms and models rely on behavioral data (e.g. clickthrough data for
ranking algorithms)…
▪ …or are substantially impacted by user behavior (e.g. LLMs trained on user-generated
content that in turn is driven by user interactions)
▪ Central to various research fields in CS concerned with information behavior:
interactive information retrieval, HCI, user modeling, Web mining, etc
5. Why is it important?
▪ Spawned entirely new research
areas like Computational Social
Science (CSS)
6. Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data & methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
7. Challenges: dependencies on 3rd party gatekeepers
Behavioral data is usually tied to specific
platforms, not distributed as the WWW
8. Challenges: volatility & decay of data
• Data is not persistent
• Example: deletion ratio of tweets
between 25-29 %
• Differs between different samples
11. Challenges: legal restrictions and ethical concerns
▪ Behavioral web data tends to involve sensitive information
▪ Ethical concerns, e.g., when information is taken out of context
▪ Various national and international laws (GDPR etc)
▪ Licensing / legal aspects: Twitter terms of service, copyright, etc.
▪ At the same time: right to archive / research wired into various national legislations
▪ Different constraints for (a) archiving and (b) sharing / using data as well as for
different uses & users (e.g. archival institutions)
▪ Individual risk assessment per use case: What (kind of data)?, For what purpose? By
whom?
12. Overview
▪ Challenges of behavioral web data
▪ Case studies (collecting, sharing, analysis: data and methods)
o„Found“ behavioral web data
o„Designed“ behavioral web data
▪ Take-aways & outlook
13. 15
Range of research concerned with IR & CSS:
▪ Insights, e.g.:
− Understanding information interaction (e.g. during search)
− Spreading of claims and misinformation
− Effect of biased news/claims on public opinion
▪ Computational Methods, e.g.:
− Crawling, harvesting, scraping of data
− Information retrieval & ranking
− Extraction of structured knowledge
(entities, sentiments, stances, claims, etc)
− Classification of search/navigation behavior or users
Found & designed web data for investigating (mis)information behavior
http://gesis.org/en/kts
14. Found behavioral web data
▪ Data that can be harvested via open APIs or
scraped from the public web over long time
periods and captures real-world online
interactions “found” in the wild
▪ Examples: social web posts/interactions, Twitter/x
data (specifically before API shutdown)
▪ Tends to include data that has been shared
voluntarily by online users, e.g. Twitter users
▪ But: users usually did not provide explicit consent
for secondary use of their data
15. Case study: Twitter/X
Motivation
Archival perspective:
▪ Ensure long-term archival of volatile information from Twitter
▪ Independence from third-party data access / APIs
Research perspective
▪ Training and evaluating machine learning models (e.g., NER, classification)
▪ Large-scale analyses (e.g., language use, trends)
▪ Facilitate interdisciplinary research on societal online discourse
(e.g. political science, communication science, psychology, sociology)
→ Goal: capture a representative sample of all Twitter data
16. 18
Why real-time collection & preservation of Twitter/X data?
▪ Approx. 28% of tweets deleted over time
▪ Power law distribution: vast majority of tweets is
deleted by small number of users
▪ Prevalent biases in deleted/non-deleted data: anti-
science, conservative and hard-line views more
frequent in deleted tweets
Data decay
17. 19
Why real-time collection & preservation of Twitter/X data?
Model decay due to evolving language & vocabulary
▪ Models & LLMs trained on large volumes of text
▪ Yet: strong vocabulary shift, over-
/underrepresentation of topics/vocabulary in
particular time periods (e.g. Twitter COVID19-
discourse 2020 vs 2019)
▪ LLMs for online discourse analysis require
frequent training and updates (and continuous
access to data)
Source: Hombaiah et al., “Dynamic Language Models for continuously evolving Content”, SIGKDD2021
18. Redundant crawls of 1% Twitter stream via Firehose API
20
▪ 14 billion tweets collected between 04/2013 – 05/2023
▪ Largest continuous tweet archive for research purposes
▪ Legal, ethical and licensing constraints (Twitter ToC)
▪ Data sharing via:
o Sensitive data access: facilitating on-prem research on data (e.g. online/offline
secure data centers) or contract-based sharing of sensitive data
o Public, non-sensitive data offers: creating non-sensitive derivatives from raw data to
facilitate research
19. 22
TweetsKB – a non-sensitive large-scale archive of societal discourse
▪ Subset of 3 billion prefiltered tweets
(English, spam detection through pretrained classifier)
▪ Sharing of tweet metadata (time stamps, retweet
counts etc), hash tags, user mentions and dedicated
features that capture tweet semantics
(no actual user IDs and full texts)
▪ Features include [CIKM2020, CIKM2022]:
o Disambiguated mentions of entities, linked to
Wikipedia/Dbpedia
(“president”/“potus”/”trump” => dbp:DonaldTrump)
o Sentiment scores (positive/negative emotions)
o Geotags via pretrained DeepGeo model
o Science references/claims [CIKM2022]
https://data.gesis.org/tweetskb
Feature Total Unique % with >= 1 feature
Hashtags: 1,161,839,471 68,832,205 0.19
Mentions: 1,840,456,543 149,277,474 0.38
Entities: 2,563,433,997 2,265,201 0.56
Sentiment: 1,265,974,641 - 0.5
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Hafid, S., Schellhammer, S., Bringay, S., Todorov, K., Dietze, S., SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse, CIKM2022
20. 24
https://data.gesis.org/tweetskb
TweetsKB – knowledge graph schema & data access
Dimitrov, D., Fafalios, P., Yu, R., Zhu, X., Zloch, M., Dietze, S., TweetsCOV19 – A KB of Semantically Annotated Tweets about the COVID-19 Pandemic, CIKM2020
Data access via:
▪ SPARQL endpoint/REST API for demos
▪ Download of data dumps (Zenodo, SDN Datorium)
▪ So far approx. 30 K downloads
21. 25
Germany suspends
vaccinations with Astra
Zeneca
Twitter discourse zu “Impfbereitschaft” / „Vaccination hesitancy“
TweetsKB as social science research corpus
Investigating vaccine hesitancy in DACH countries
https://dd4p.gesis.org/
Boland, K. et al., Data for policy-making in times of crisis - a computational analysis of German online discourses about COVID-19 vaccinations, JMIR, under review
Germany suspends
vaccinations with Astra
Zeneca
22. Case: Telegram
26
▪ Telegram channels: public, only admin can post (as opposed to
private groups)
▪ Decentralised: no registry of channels available
▪ Continuous data collection of currently 400 K channels through
snowball sampling (300 seed channels)
▪ Full message history collected for > 10 K channels, approx. 100 M
messages so far
▪ Telegram cross-channel message passing dataset extracted to
support information spreading research, i.e., mis- and
disinformation, hate speech etc
23. 28
Understanding claims & misinformation on the Web: ClaimsKG
Motivation
▪ Claims spread across various (unstructured) fact-checking
sites
▪ Claims and truth ratings evolve over time
▪ Finding claims is hard: e.g. claims about / made by US
republican politicians across the Web?
Approach
▪ Continuous harvesting claims & metadata from fact-
checking sites (e.g. snopes.com, Politifact.com etc);
currently approx. 75.000 claims since 2019
▪ Feature extraction & linking:
o Mentioned entities
o Joint topic classification
o Normalisation of ratings (true, false, mixture, other);
coreference resolution of claims
o Exposing data through established vocabulary and W3C
standards
(e.g. SPARQL endpoint)
https://data.gesis.org/claimskg/
A. Tchechmedjiev, P. Fafalios, K. Boland, S. Dietze, B. Zapilko, K. Todorov, ClaimsKG – A Live Knowledge Graph of fact-checked Claims, ISWC2019
24. 30
Evolution of claims: frequency & topics
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
25. 31
Evolution of claims: topic biases of fact-check sources
https://data.gesis.org/claimskg/
S. Gangopadhay et al., Investigating Characteristics, Biases and Evolution of Fact-Checked Claims on theWeb, ACM Hypertext 2024 (under review)
26. 32
Stances towards claims / fake news in social media
Motivation
▪ Problem: detecting stance of documents (e.g. social media posts)
towards a given claim (unbalanced class distribution)
▪ Motivation: stance of documents (in particular disagreement) useful
(a) as signal for truthfulness (fake news detection) and (b) Document
or Source classification (e.g. users)
Approach
▪ Cascading binary classifiers: addressing individual issues (e.g.
misclassification costs) per step
▪ Features, e.g. textual similarity (Word2Vec etc), sentiments, LIWC,
etc.
▪ Best-performing models: 1) SVM with class-wise penalty, 2) CNN, 3)
SVM with class-wise penalty
▪ Experiments on FNC-1 dataset (and FNC baselines)
Results
▪ Minor overall performance improvement
▪ Improvement on disagree class by 27%
(but still far from robust)
A., Fafalios, P., Ekbal, A., Zhu, X., Dietze, S., Exploiting stance hierarchies for cost-sensitive stance detection of Web documents, J Intell. Inf. Syst. 58(1), 1-19 (2022)
27. Wrap-up: found data
33
Archival/collection
▪ Easy (assuming gatekeeper‘s goodwill), even over long time periods (TweetsKB: 10 years)
▪ Public APIs, screen-scraping, crawling
Analysis
▪ Heterogeneity and scale of data (example Tweets, query logs)
▪ Feature extraction (stances, topics, emotions, etc) across entire corpus challenging
▪ Specific research questions usually require dedicated models (no one-size-fits-all approach)
Sharing
▪ Strict constraints (legal, ethical, licensing)
▪ Scalable sharing of sensitive data still unsolved problem
28. Designed behavioral web data to the rescue
34
▪ Goal: obtain sharable and easy to interpret behavioral web
data through experimental lab studies & quasi-
experiments
▪ Typically involves:
− Artifical settings (e.g. labs),
− Simulation of real-world online scenarios
(e.g. web search)
− Usually less sensitive
− Full consent of participants about data collection &
sharing intentions
− Short time intervals
− Small-scale data (due to costly process)
29. Case: web search behavior (SAL = „Search As Learning“)
35
Research challenges at the intersection of AI/ML,
HCI & cognitive psychology
▪ Detecting coherent search missions?
▪ Detecting learning throughout search?
detecting “informational” search missions (as
opposed to “transactional” or “navigational”
missions)
▪ How competent is the user? –
Predict/understand knowledge state of users
based on in-session behavior/interactions
▪ How well does a user achieve his/her learning
goal/information need? - Predict knowledge gain
throughout search session
Hoppe, A., Holtz, P., Kammerer, Y., Yu, R., Dietze, S., Ewerth, R., Current Challenges for Studying Search as Learning Processes, 7th Workshop on
Learning & Education with Web Data (LILE2018), in conjunction with ACM Web Science 2018 (WebSci18), Amsterdam, NL, 27 May, 2018.
30. Data collection for understanding knowledge gain/state of users
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
Data collection - summary
▪ Crowdsourced collection of search session data
▪ 10 search topics (e.g. “Altitude sickness”,
“Tornados”), incl. pre- and post-tests to assess
user knowledge
▪ Approx. 1000 distinct crowd workers & 100
sessions per topic
▪ Tracking of user behavior through 76 features
in 5 categories (session, query, SERP – search
engine result page, browsing, mouse traces)
31. Understanding knowledge gain/state of users during web search
37
Some results
▪ 70% of users exhibited a knowledge gain (KG)
▪ Negative relationship between KG of users and
topic popularity (avg. accuracy of workers in
knowledge tests) (R= -.87)
▪ Amount of time users actively spent on web pages
describes 7% of the variance in their KG
▪ Query complexity explains 25% of the variance in
the KG of users
▪ Topic-dependent behavior: search behavior
correlates stronger with search topic than with
KG/KS
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM CHIIR 2018.
32. ▪ Same session data as Gadiraju et al., 2018
▪ Stratification of users into classes: user knowledge state (KS)
and knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
▪ Supervised multiclass classification
(Naive Bayes, Logistic regression, SVM, random forest, multilayer perceptron)
▪ KG prediction performance results (after 10-fold cross-validation)
▪ Considers in-session features (behavioural traces) only
Predicting knowledge gain/state during web search
38
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
33. Predicting knowledge gain/state during SAL: Features
39
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
Behavioral
features
34. ▪ Feature importance (knowledge gain prediction task)
Predicting knowledge gain/state during web search
40
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
35. ▪ Feature importance (knowledge state prediction task)
Predicting knowledge gain/state during web search
41
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S., Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. ACM SIGIR 2018.
36. Gaze data as additional source of behavioral data in SAL
42
Davari, M., Yu., R., Dietze, S., Understanding the Influence of Topic Familiarity on Search
Behavior in Digital Libraries, EARS 2019 – International Workshop on ExplainAble
Recommendation and Search, @ SIGIR2019, 2019.
Otto, C., Yu, R., Pardi, G., von Hoyer, J., Rokicki, M., Hoppe, A., Holtz, P., Kammerer, Y.,
Dietze, S., Ewerth, E., Predicting Knowledge Gain during Web Search based on Multimedia
Resource Consumption, 22nd International Conference on Artificial Intelligence in Education
(AIED2021), 2021
▪ Eye gaze data (word-, sentence-, or HTML structure-
level) as additional source of behavioral data
▪ Various studies in SAL context and beyond to
understand topic familiarity, knowledge &
competence or comprehension issues
▪ Usually small study sizes (e.g. 25 < N < 150)
▪ Costly but highly informative features
37. Facilitating SAL research through public research data
43
https://data.uni-hannover.de/dataset/sal-dataset
Otto, C., Rokicki, M., Pardi, G., Gritz, W., Hienert, D.,Yu, R., Hoyer, J., Hoppe, A., Dietze, S., Holtz, P., Kammerer, Y., Ewerth, R., SaL-Lightning Dataset: Search and Eye
Gaze Behavior, Resource Interactions and Knowledge Gain during Web Search, ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR2022).
38. Case: crowd worker behavior in microtask crowdsourcing
44
Gadiraju, U., Kawase, R., Dietze, S, Demartini, G., Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys. ACM CHI2015
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S., Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-
selection, Computer Supported Cooperative Work 28(5): 815-841 (2019), Springer, 2019.
„Fast Deceiver“
„Competent Worker“
▪ Context: online crowdsourcing tasks widely used to
collect data
▪ Research question: can we classify different worker
types (and detect competent workers) from behavioral
traces alone (mouse movements, scrolling, keystrokes
etc)
▪ Various studies in experimental conditions capturing
wide range of features in various tasks
▪ Low-level behavioural features highly informative when
predicting worker competence and output quality
39. Wrap-up: found vs designed behavioral data
45
FOUND DATA DESIGNED DATA
As long as gategeepers allow
crawling / scraping
Large & heterogeneous data;
long time intervals;
no „one-size-fits-all“ methods
Sensitive information;
Ethical, legal, licensing constraints
Costly experimental data collection
Homogeneous, small scale data;
short time intervals;
Limited use cases
Full consent of participants;
little sensitive information due to
artifical tasks
Collection
Analysis
Sharing
40. ▪ Behavioral Web Data: crucial ingredient for wide range of research across various disciplines
▪ Found Data: crucial to archive to ensure long-term access; sharing is hard due to sensitive
information.
▪ Designed Data: collection is costly; limited scale and scope of data.
▪ Access to behavioral web data remains challenge => ongoing & future work @ KTS/GESIS on
− infrastructures for collecting experimental data (e.g. in web search)
− infrastructures for data access (e.g. for tweet archives)
− non-sensitive data offers to enable reuse of sensitive found data (e.g. TweetsKB)
Key take-aways
46