Towards Topics-based, Semantics-assisted News Search | WIMS13

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Martin Voigt, Michael Aleythe, Peter Wehner

Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 1

Structure
Topic/S Workflow
Demo
Conclusion

Motivation
fink & PARTNER Media Services GmbH
Media management for publishing houses
Some customers
Chair of Multimedia Technology, TU Dresden
Research fields
Adaptive, composite Rich Internet Applications
Semantic document life cycle management

Motivation
Newsroom
Quelle: ringier.com

Problem
Overwhelming amount of data
e.g., Mainpost 2000 articles/day from agencies
and in-house production
Friday, 14.06.2013 Topic/S
DPA
Reuters
KNA
Twitter
Facebook
Blogs
…
News agencies
Web, social media
…
In-house production
Archive
Online
Slide 5

Problem

Problem
Hard to identify topics
Browsing
Keyword-Identification
And their
Relations, Media, and Trend
Quelle: Zeit.de

Vision
Automatic topic discovery using Named Entities and
other keywords (Semantic Items, SemItem)
Investigation of trending topics
Push them to the editor
MA1
E1
E2
E4
E3
E7
E6
E5
MA2
Media
Assets
Named
Entities
Pre-Processing
MA1
E1
T1E2
E4
E3
E7
E6
T2
T3
E5
MA2
Media
Assets
Named
Entities
Topics
Pre-Processing Post-Processing
Slide 8

Requirements
Extraction and disambiguation
of (German) SemItems
Model and storage of semantic
information
Topic and trend discovery
Scalable architecture for
business use case

Structure
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Conclusion

Workflow

Structure
Topic/S Workflow
– Overview
– Pre-Processing
– Post-Processing
Demo
Conclusion

Workflow: Preprocessor
Language Recognition
Based on article content
Support German/English
Rule-based solution:
– Words with capital letter (en 18% vs. de 43%)
– Occurrence of umlauts (ä,ö,ü)
– Existence of language specific words
• en: of, to, and, a, for, the, that
• de: der, das, und, sich, auf
Precision: 99%
Slide 13
Quelle: onelanguageoneposter.com

Workflow: Präprozessor
Keywords
Lemmatization
Developing a word list
Extraction using the word list
Bonus: frequent terms of an article
Slide 14
Quelle: hugdaily.org

Categorisation
Classification of text
One categorizer per news-agency
IPTC categories
Categories useful for identifying topics
Slide 15

Categorisation - Training
Politics
Article IPTC Media Topic Categoriser
Slide 16

Categorisation - Training
Politics
Article IPTC Media Topic Categoriser OTS
Politics
Article IPTC Media Topic Categoriser Reuters
Politics
Article IPTC Media Topic Categoriser DPA
DPA
Reuters
OTS
Slide 17

Categorisation
Politics
Article DPA IPTC Media Topic
Categoriser OTS
Categoriser DPA
Categoriser Reuters
Slide 18

Categorisation - Quality
News-Agency accuracy
KNA 80,3 %
DPA 94,4 %
EPD 80,3 %
Reuters 90,8 %
OTS 93,5 %
AFP 86 %
Method accuracy
One cat. for all agencies 85 %
One cat. per agency 87,5 %
Slide 19

Named Entity Recognition
Recognition of persons,
organizations, places
two methods: word list, statistics
additional information:
– occurrence count
– text part NE appeared in
Slide 20
Quelle: churchthought.com

Named Entity Recognition – Approaches
word list
Tool: LingPipe + Extension
Sources: LOD (DBPedia, Geonames, YAGO2)
Advantages: controlled vocabulary,
guarantied recognition of entities
statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Slide 21

Structure
Topic/S Workflow
– Overview
– Pre-Processing
– Post-Processing
Demo
Conclusion

Semantic Model

Storage of Semantic Data
Benchmark of triple stores [Voigt2012]
No benchmark found with real-world data, inference,
SPARQL 1.1, and multi-client
What have we done?
4 datasets, 5 stores, 15 queries per dataset
Loading time, memory requirement, per-query
type & multi-client performance
Result
No clear recommendation, strongly depends on
project requirements

Storage of Semantic Data
Using Oracle 11gR2
Pros
Already available, existing knowledge
Nearly as fast as Virtuoso etc.
Integrated querying of relational and
semantic data
Spatial data mining features
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support

Semantic Facts
Named Entities required but no lists available
Manual search, extraction, and
cleaning for named entities from
YAGO2 , Freebase, JRC_Names,
Tagesspiegel, DBpedia
Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller

Semantic Facts
BUT only named entities cause bad topics keywords
required, e.g.,
Waffenstillstand (cease-fire), Meister
(champion), Klimaschutz (climate protection), …
Some numbers
Triples without SemItems: 10,3 Mio.
SemItem Number (with alt. names)
Person 590.828 (860.594)
Organization 63.262 (98.052)
Place 89.672 (95.146)
Keyword 1329
Slide 28

Structure
Topic/S Workflow
– Overview
– Pre-Processing
– Post-Processing
Demo
Conclusion

Workflow: Postprocessor
Clustering
Slide 30

Clustering
Slide 31

Clustering
Merkel
Politics
Highway
Traffic
Audi
Obama
Slide 32

Clustering (Top Cluster 06.06.2013)
Article First
Date
Name Hot
Topic
7 6.6. "Bürgermeister","Gemeinde",
"Gemeinderat", "Kosten"
No
4 6.6. "Abzug", "Bürgerkrieg", "Grenze",
"Soldat", "Österreich", "Syrien", "Tel Aviv",
"Vereinten Nationen"
Yes
3 6.6. "Vertrag", "Vorstandschef","München","FC
Bayern München","FC Bayern München
AG","Olympique Marseille","Daniel Van
Buyten","Franck Ribery","Karl-Heinz
Rummenigge"
Yes
2 4.6. "Ministerpräsident","Protest","Istanbul",
"Tunis","Recep Tayyip Erdogan"
Yes
Slide 33

Topic trend
Date Article SemItems
4.6. 6 "Demonstrant","Ministerpräsident","Protest",
"Regierung","Stadtteil","Istanbul","Recep Tayyip
Erdogan
5.6. 14 "Demonstrant","Protest","Istanbul","Recep
Tayyip Erdogan"
6.6. 2 "Ministerpräsident","Protest","Istanbul","Tunis",
"Recep Tayyip Erdogan"
7.6. 9 "Demonstrant","Protest","Recep Tayyip Erdogan"
Slide 34

Structure
Topic/S Workflow
– Overview
– Pre-Processing
– Post-Processing
Demo
Conclusion

Workflow: Related Article
Related Article
• Person
• Location
• Organisation
• Keywords
Slide 36

Related Article - relatedness
• computes topic-based difference between
articles
• Detecting main entities in articles
• navigation recommendation for user
Slide 37

Related Article - relatedness
Bernd Lucke
Berlin
Occurrence: 1 + 4
Occurrence : 0 + 1
Bernd Lucke
Occurrence : 0 + 4
AfD
Occurrence : 0 + 5
Berlin
Klaus Wowereit
Occurrence : 1 + 3
Occurrence : 1 + 4
Slide 38

Structure
Topic/S Workflow
Demo
Conclusion

Live Demo

Structure
Topic/S Workflow
Demo
User Interfaces
Disambiguation
Conclusion

Static User Interface

Dynamic User Interface

Structure
Topic/S Workflow
Demo
User Interfaces
ConclusDisambiguationion

Disambiguation
Quelle: fansshare.comQuelle: lounge.espdisk.com
Quelle: de.wikipedia.org

Disambiguation
Problem: not all SemItems available in the LOD
Michael Jackson
Beer
Michael Jackson
Beer
Whiskey
Michael Jackson
Music
King of Pop
Internal Facts
External Facts
(DBpedia, etc.)
Identification of
Entity Cluster
Slide 46

Structure
Topic/S Workflow
Demo
Conclusion

Sum it up!
Result
Identifying topics and pushing them
to the editor
Lessons learned
NER: bad for non-English,
combination required
model needs to be optimized for
queries
dedicated user interface required
Outlook
prediction of topics with
causal/temporal relations
Quelle: ooltapulta.com
Quelle: business-strategy-innovation.com

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Thanks! Questions?

Towards Topics-based, Semantics-assisted News Search | WIMS13

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Towards Topics-based, Semantics-assisted News Search | WIMS13

Similar to Towards Topics-based, Semantics-assisted News Search | WIMS13 (20)

Recently uploaded

Recently uploaded (20)

Towards Topics-based, Semantics-assisted News Search | WIMS13