(Semi-)Automatic analysis of online contents

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
(Semi-)Automatic Analysis
of Online Contents
Steffen Staab
@ststaab
Web and Internet Science Group · ECS · University of Southampton, UK &

(Semi-)Automatic analysis of online content 2/68Steffen Staab
Content analysis

Is it difficult?
„Nach dem Auspacken der LPS-105 präsentiert sich dem
Betrachter ein stabiles Laufwerk, das genauso geringe
Außenmaße besitzt wie die Maxtor.“
Unpacking the LPS 105 reveals a sturdy disk drive which is of
the same small size as the Maxtor.

„Content“ analysis: What is in online content?
....
Entailment
Summaries
Arguments
Discourse
Opinions
Sentiments
Facts – who, what, when?
Syntax
Semantics
Pragmatics
Knowledge

Purpose
Technical objectives
• Search
• data & knowledge
bases:
• facts
• arguments
• ...
Applications
• Google Search
• Watson
• „Watson 2“
Social science and
humanities objectives
• Form hypotheses
• Find indications
• Recognize trends
• ...

Objective oriented content analysis
....
Entailment
Summaries
Arguments
Discourse
Opinions
Sentiments
Facts – who, what, when?
Syntax
Semantics
Pragmatics
Knowledge

SEMANTIC WEB ANNOTATION

CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Document
Viewer / EditorOntology
Guidance &
Fact Browser
Concepts
Instances of
Concepts
Attribute Instances =
instance of a property
to a datatype instance
Relationship Instances =
instance of a property
to a class instance

CREAM – Creating Metadata (Handschuh et al 2002, 2003)
Open world - Target ontologies
now could be:
• Schema.org
(3 Trillion facts collected by
Google; 10,000 of concepts)
• Wikidata
1,148,230 concepts
(2 weeks ago)

Annotating facts with Cream
+++
Open (wrt ontologies)
Flexible
Semi-automatic: SCREAM
---
Effort for annotation
(minimize # of clicks)
Thick Client
Tech Readiness Level ~5
A lot of effort to prepare tool
for a task
Limited accuracy

Technology Readiness Levels
TRL 1: Beobachtung und Beschreibung des
Funktionsprinzips (8-15 Jahre zur Marktreife)
TRL 2: Beschreibung der Anwendung einer Technologie
TRL 3: Nachweis der Funktionstüchtigkeit einer
Technologie (5-13 Jahre zur Marktreife)
TRL 4: Versuchsaufbau im Labor
TRL 5: Versuchsaufbau in Einsatzumgebung
TRL 6: Prototyp in Einsatzumgebung
TRL 7: Prototyp im Einsatz (1-5 Jahre)
TRL 8: Qualifiziertes System mit Nachweis der
Funktionstüchtigkeit im Einsatzbereich
TRL 9: Qualifiziertes System mit Nachweis des
erfolgreichen Einsatzes

CLUSTERING OF TEXTDATA
http://topicmodels.west.uni-koblenz.de
With Christoph Kling

Text Mining Documents
Documents are
 PDFs, emails, tweets,
Flickr photo tags,
Word companions,…
Documents consist of
 bag of words
 metadata
- author(s)
- timestamp
- geolocation
- publisher
- booktitle
- device
...
Chinese
food
Vegan
food
Break
-fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...
Objective:
Cluster, categorize,
& explain

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)
Document-topic distributions
Topic-word distributions
K topics
M documents
Each doc m from M has length Nm

Use Metadata to Help Topic Prediction
 Improve topic detection
→ Morning times may help to improve the breakfast topic
 Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...

Use Metadata to Help Topic Prediction
 Improve topic detection
→ Morning times may help to improve the breakfast topic
 Describe dependencies: metadata ↔ topics
→ breakfast topic happens
during morning hours
 Usage
 Autocompletion
→ From words to words
 Prediction of search queries
→ From metadata to words
→ From words to metadata
Chinese
food
Vegan
food
Break
-
fast
dimsum
duck
eggs
...
vegan
tofu
...
eggs
ham
...

Dataset
 Linux Kernel Mailinglist
3,400,000 emails with timestamps and mailinglist ID

 Nominal
 Ordinal
 Cyclic
 Spherical
 Networked
Structures of Metadata Spaces Kern Desk Mail
Spatial Model is not used in this application
(but might be)!

Topics

Topics
 Professional topics:
 Hobbyist topics:

Topics
 Metadata weighting:

126,408 Online Fetish Users: First 8 Topics

Sociodemographics of Fetish dataset

Influence of Sociodemographics on Favorite Fetishes

Other applications of (extended) LDA
Sentiment and Topics
(Naveed et al ICWSM 2013)
Topics and spatial knowledge
(Kling et al WSDM 2014)
Modelling of power
(Kling et al ICWSM 2015)

BELIEVABILITY AND TRUST IN
ONLINE NEWS
With Christoph Kling, Jerome Kunegis
Collaboraiton with Jutta Milde, Karin Stengel, Ines Vogel
Ongoing work in KOMEPOL

Targets

Example article at Spiegel.de

Requirements
Scalability:
• # Documents
• # Annotators
• # Annotations per
annotater
Tool:
• Administration
• Crowdsourcing
• Semi-automatic

Separating article management and coding

Text-Upload

Managing projects

Article

Defining a Coding-Job

Highlighting using Keywords and Clustering

Article coding

Preparing a code book (1)

Preparing a code book (2)

CONCLUSION

Lessons Learned
New targets
• Require new modeling of
gaps
Challenges
• Technology Readiness
Levels
• Many tools – no „good“ tool
(„done is better than
perfect“?)
• Reproducability
ToDos
• Eclipse/Protege of
annotation
• modular
• extensible
• open
• Optimizing the processes

No tool to rule them all
....
Entailment
Summaries
Arguments
Discourse
Opinions
Sentiments
Facts – who, when, where, what?
Syntax
Semantics
Pragmatics
Knowledge

THANK YOU FOR YOUR
ATTENTION!

C. C. Kling, J. Kunegis, S. Sizov, and S. Staab. “Detecting non-gaussian geographical topics in tagged photo
collections.” In: Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014,
New York, NY, USA, February 24-28, 2014.
I. C. Vogel, J. Milde, K. Stengel, S. Staab, C. C. Kling, and J. Kunegis. “Glaubwürdigkeit und Vertrauen von
Online-News.” In: Datenschutz und Datensicherheit 39.5 (2015), pp. 312–316.
S. Handschuh, S. Staab. CREAM – CREAting Metadata for the Semantic Web. Computer Networks. 42(5):
579-598, Elsevier 2003.
S. Handschuh, S. Staab, F. Ciravegna. S-CREAM – Semi-automatic CREAtion of Metadata.In: Proc. of the
European Conference on Knowledge Acquisition and Management – EKAW-2002 . Madrid, Spain,
October 1-4, 2002. LNCS/LNAI 2473, Springer, 2002, pp. 358-372.
C. Kling. Probabilistic Models for Context in Social Media. Novel Approaches and Inference Schemes.
Submitted as PhD thesis, Institute for Web Science and Technologies, University of Koblenz-Landau, to
be defended Nov/Dec 2016
Nasir Naveed, Thomas Gottron, Steffen Staab:Feature Sentiment Diversification of User Generated Reviews:
The FREuD Approach. ICWSM 2013
Christoph Carl Kling, Jérôme Kunegis, Heinrich Hartmann, Markus Strohmaier, Steffen Staab:Voting
Behaviour and Power in Online Democracy: A Study of LiquidFeedback in Germany's Pirate Party.
ICWSM 2015: 208-217
Bibliography

URLs
http://topicmodels.west.uni-koblenz.de
http://komepol.west.uni-koblenz.de
http://www.slideshare.net/steffenstaab

(Semi-)Automatic analysis of online contents

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to (Semi-)Automatic analysis of online contents

Similar to (Semi-)Automatic analysis of online contents (20)

More from Steffen Staab

More from Steffen Staab (20)

Recently uploaded

Recently uploaded (20)

(Semi-)Automatic analysis of online contents