How can media and discourse analyses combine approaches from humanities and statistical methods to deeply analyse large amounts of online contents.
Invited talk at Fachgruppen-Workshop der Deutschen Gesellschaft für Publizistik und Kommunikationswissenschaft
Soziale Medien – Echo-Kammer oder öffentlicher Raum?
Ansätze zur computergestützten Analyse von Internet-Korpora
6. Oktober 2016, Karlsruher Institut für Technologie (KIT)
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
(Semi-)Automatic analysis of online contents
1. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
(Semi-)Automatic Analysis
of Online Contents
Steffen Staab
@ststaab
Web and Internet Science Group · ECS · University of Southampton, UK &
3. (Semi-)Automatic analysis of online content 3/68Steffen Staab
Is it difficult?
„Nach dem Auspacken der LPS-105 präsentiert sich dem
Betrachter ein stabiles Laufwerk, das genauso geringe
Außenmaße besitzt wie die Maxtor.“
Unpacking the LPS 105 reveals a sturdy disk drive which is of
the same small size as the Maxtor.
4. (Semi-)Automatic analysis of online content 4/68Steffen Staab
„Content“ analysis: What is in online content?
....
Entailment
Summaries
Arguments
Discourse
Opinions
Sentiments
Facts – who, what, when?
Syntax
Semantics
Pragmatics
Knowledge
5. (Semi-)Automatic analysis of online content 5/68Steffen Staab
Purpose
Technical objectives
• Search
• data & knowledge
bases:
• facts
• arguments
• ...
Applications
• Google Search
• Watson
• „Watson 2“
Social science and
humanities objectives
• Form hypotheses
• Find indications
• Recognize trends
• ...
8. (Semi-)Automatic analysis of online content 8/68Steffen Staab
CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Document
Viewer / EditorOntology
Guidance &
Fact Browser
Concepts
Instances of
Concepts
Attribute Instances =
instance of a property
to a datatype instance
Relationship Instances =
instance of a property
to a class instance
9. (Semi-)Automatic analysis of online content 9/68Steffen Staab
CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Open world - Target ontologies
now could be:
• Schema.org
(3 Trillion facts collected by
Google; 10,000 of concepts)
• Wikidata
1,148,230 concepts
(2 weeks ago)
10. (Semi-)Automatic analysis of online content 10/68Steffen Staab
Annotating facts with Cream
+++
Open (wrt ontologies)
Flexible
Semi-automatic: SCREAM
---
Effort for annotation
(minimize # of clicks)
Thick Client
Tech Readiness Level ~5
A lot of effort to prepare tool
for a task
Limited accuracy
11. (Semi-)Automatic analysis of online content 11/68Steffen Staab
Technology Readiness Levels
TRL 1: Beobachtung und Beschreibung des
Funktionsprinzips (8-15 Jahre zur Marktreife)
TRL 2: Beschreibung der Anwendung einer Technologie
TRL 3: Nachweis der Funktionstüchtigkeit einer
Technologie (5-13 Jahre zur Marktreife)
TRL 4: Versuchsaufbau im Labor
TRL 5: Versuchsaufbau in Einsatzumgebung
TRL 6: Prototyp in Einsatzumgebung
TRL 7: Prototyp im Einsatz (1-5 Jahre)
TRL 8: Qualifiziertes System mit Nachweis der
Funktionstüchtigkeit im Einsatzbereich
TRL 9: Qualifiziertes System mit Nachweis des
erfolgreichen Einsatzes
12. (Semi-)Automatic analysis of online content 12/68Steffen Staab
CLUSTERING OF TEXTDATA
http://topicmodels.west.uni-koblenz.de
With Christoph Kling
13. (Semi-)Automatic analysis of online content 13/68Steffen Staab
Text Mining Documents
Documents are
PDFs, emails, tweets,
Flickr photo tags,
Word companions,…
Documents consist of
bag of words
metadata
- author(s)
- timestamp
- geolocation
- publisher
- booktitle
- device
...
Chinese
food
Vegan
food
Break
-fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
Objective:
Cluster, categorize,
& explain
15. (Semi-)Automatic analysis of online content 15/68Steffen Staab
Latent Dirichlet Allocation (LDA)
Document-topic distributions
Topic-word distributions
K topics
M documents
Each doc m from M has length Nm
16. (Semi-)Automatic analysis of online content 16/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection
→ Morning times may help to improve the breakfast topic
Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
17. (Semi-)Automatic analysis of online content 17/68Steffen Staab
Use Metadata to Help Topic Prediction
Improve topic detection
→ Morning times may help to improve the breakfast topic
Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours
Usage
Autocompletion
→ From words to words
Prediction of search queries
→ From metadata to words
→ From words to metadata
Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
18. (Semi-)Automatic analysis of online content 18/68Steffen Staab
Dataset
Linux Kernel Mailinglist
3,400,000 emails with timestamps and mailinglist ID
19. (Semi-)Automatic analysis of online content 19/68Steffen Staab
Nominal
Ordinal
Cyclic
Spherical
Networked
Structures of Metadata Spaces Kern Desk Mail
Spatial Model is not used in this application
(but might be)!
26. (Semi-)Automatic analysis of online content 26/68Steffen Staab
Influence of Sociodemographics on Favorite Fetishes
27. (Semi-)Automatic analysis of online content 27/68Steffen Staab
Other applications of (extended) LDA
Sentiment and Topics
(Naveed et al ICWSM 2013)
Topics and spatial knowledge
(Kling et al WSDM 2014)
Modelling of power
(Kling et al ICWSM 2015)
28. (Semi-)Automatic analysis of online content 28/68Steffen Staab
BELIEVABILITY AND TRUST IN
ONLINE NEWS
With Christoph Kling, Jerome Kunegis
Collaboraiton with Jutta Milde, Karin Stengel, Ines Vogel
Ongoing work in KOMEPOL
42. (Semi-)Automatic analysis of online content 42/68Steffen Staab
Lessons Learned
New targets
• Require new modeling of
gaps
Challenges
• Technology Readiness
Levels
• Many tools – no „good“ tool
(„done is better than
perfect“?)
• Reproducability
ToDos
• Eclipse/Protege of
annotation
• modular
• extensible
• open
• Optimizing the processes
43. (Semi-)Automatic analysis of online content 43/68Steffen Staab
No tool to rule them all
....
Entailment
Summaries
Arguments
Discourse
Opinions
Sentiments
Facts – who, when, where, what?
Syntax
Semantics
Pragmatics
Knowledge
45. (Semi-)Automatic analysis of online content 45/68Steffen Staab
C. C. Kling, J. Kunegis, S. Sizov, and S. Staab. “Detecting non-gaussian geographical topics in tagged photo
collections.” In: Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014,
New York, NY, USA, February 24-28, 2014.
I. C. Vogel, J. Milde, K. Stengel, S. Staab, C. C. Kling, and J. Kunegis. “Glaubwürdigkeit und Vertrauen von
Online-News.” In: Datenschutz und Datensicherheit 39.5 (2015), pp. 312–316.
S. Handschuh, S. Staab. CREAM – CREAting Metadata for the Semantic Web. Computer Networks. 42(5):
579-598, Elsevier 2003.
S. Handschuh, S. Staab, F. Ciravegna. S-CREAM – Semi-automatic CREAtion of Metadata.In: Proc. of the
European Conference on Knowledge Acquisition and Management – EKAW-2002 . Madrid, Spain,
October 1-4, 2002. LNCS/LNAI 2473, Springer, 2002, pp. 358-372.
C. Kling. Probabilistic Models for Context in Social Media. Novel Approaches and Inference Schemes.
Submitted as PhD thesis, Institute for Web Science and Technologies, University of Koblenz-Landau, to
be defended Nov/Dec 2016
Nasir Naveed, Thomas Gottron, Steffen Staab:Feature Sentiment Diversification of User Generated Reviews:
The FREuD Approach. ICWSM 2013
Christoph Carl Kling, Jérôme Kunegis, Heinrich Hartmann, Markus Strohmaier, Steffen Staab:Voting
Behaviour and Power in Online Democracy: A Study of LiquidFeedback in Germany's Pirate Party.
ICWSM 2015: 208-217
Bibliography