4. Motivation
fink & PARTNER Media Services GmbH
Media management for publishing houses
Some customers
Chair of Multimedia Technology, TU Dresden
Research fields
Adaptive, composite Rich Internet Applications
Semantic document life cycle management
Friday, 14.06.2013 Topic/S Slide 3
6. Problem
Overwhelming amount of data
e.g., Mainpost 2000 articles/day from agencies
and in-house production
Friday, 14.06.2013 Topic/S
DPA
Reuters
KNA
Twitter
Facebook
Blogs
…
News agencies
Web, social media
…
In-house production
Archive
Online
Slide 5
8. Problem
Hard to identify topics
Browsing
Keyword-Identification
And their
Relations, Media, and Trend
Friday, 14.06.2013 Topic/S Slide 7
Quelle: Zeit.de
9. Vision
Automatic topic discovery using Named Entities and
other keywords (Semantic Items, SemItem)
Investigation of trending topics
Push them to the editor
Friday, 14.06.2013 Topic/S
MA1
E1
E2
E4
E3
E7
E6
E5
MA2
Media
Assets
Named
Entities
Pre-Processing
MA1
E1
T1E2
E4
E3
E7
E6
T2
T3
E5
MA2
Media
Assets
Named
Entities
Topics
Pre-Processing Post-Processing
Slide 8
10. Requirements
Extraction and disambiguation
of (German) SemItems
Model and storage of semantic
information
Topic and trend discovery
Scalable architecture for
business use case
Friday, 14.06.2013 Topic/S Slide 9
11. Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 10
13. Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 12
14. Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Language Recognition
Based on article content
Support German/English
Rule-based solution:
– Words with capital letter (en 18% vs. de 43%)
– Occurrence of umlauts (ä,ö,ü)
– Existence of language specific words
• en: of, to, and, a, for, the, that
• de: der, das, und, sich, auf
Precision: 99%
Slide 13
Quelle: onelanguageoneposter.com
15. Workflow: Präprozessor
Friday, 14.06.2013 Topic/S
Keywords
Lemmatization
Developing a word list
Extraction using the word list
Bonus: frequent terms of an article
Slide 14
Quelle: hugdaily.org
16. Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation
Classification of text
One categorizer per news-agency
IPTC categories
Categories useful for identifying topics
Slide 15
20. Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation - Quality
News-Agency accuracy
KNA 80,3 %
DPA 94,4 %
EPD 80,3 %
Reuters 90,8 %
OTS 93,5 %
AFP 86 %
Method accuracy
One cat. for all agencies 85 %
One cat. per agency 87,5 %
Slide 19
21. Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Named Entity Recognition
Recognition of persons,
organizations, places
two methods: word list, statistics
additional information:
– occurrence count
– text part NE appeared in
Slide 20
Quelle: churchthought.com
22. Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Named Entity Recognition – Approaches
word list
Tool: LingPipe + Extension
Sources: LOD (DBPedia, Geonames, YAGO2)
Advantages: controlled vocabulary,
guarantied recognition of entities
statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Slide 21
23. Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 22
24. Semantic Model
Requirements
information life cycle | simple | fast querying |
schema reuse | inference | ...
Foundations
SNaP Ontologies, IPTC NewsCodes, W3C Ontology
for Media Resources, schema.org
RDFS, less OWL
Conventions, versioning, and documentation
Friday, 14.06.2013 Topic/S Slide 23
26. Storage of Semantic Data
Benchmark of triple stores [Voigt2012]
No benchmark found with real-world data, inference,
SPARQL 1.1, and multi-client
What have we done?
4 datasets, 5 stores, 15 queries per dataset
Loading time, memory requirement, per-query
type & multi-client performance
Result
No clear recommendation, strongly depends on
project requirements
Friday, 14.06.2013 Topic/S Slide 25
27. Storage of Semantic Data
Using Oracle 11gR2
Pros
Already available, existing knowledge
Nearly as fast as Virtuoso etc.
Integrated querying of relational and
semantic data
Spatial data mining features
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support
Friday, 14.06.2013 Topic/S Slide 26
28. Semantic Facts
Named Entities required but no lists available
Manual search, extraction, and
cleaning for named entities from
YAGO2 , Freebase, JRC_Names,
Tagesspiegel, DBpedia
Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller
Friday, 14.06.2013 Topic/S Slide 27
29. Semantic Facts
BUT only named entities cause bad topics keywords
required, e.g.,
Waffenstillstand (cease-fire), Meister
(champion), Klimaschutz (climate protection), …
Some numbers
Triples without SemItems: 10,3 Mio.
Friday, 14.06.2013 Topic/S
SemItem Number (with alt. names)
Person 590.828 (860.594)
Organization 63.262 (98.052)
Place 89.672 (95.146)
Keyword 1329
Slide 28
30. Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 29
38. Workflow: Related Article
Friday, 14.06.2013 Topic/S
Related Article - relatedness
• computes topic-based difference between
articles
• Detecting main entities in articles
• navigation recommendation for user
Slide 37
47. Disambiguation
Problem: not all SemItems available in the LOD
Friday, 14.06.2013 Topic/S
Michael Jackson
Beer
Michael Jackson
Beer
Whiskey
Michael Jackson
Music
King of Pop
Internal Facts
External Facts
(DBpedia, etc.)
Identification of
Entity Cluster
Slide 46
49. Sum it up!
Result
Identifying topics and pushing them
to the editor
Lessons learned
NER: bad for non-English,
combination required
model needs to be optimized for
queries
dedicated user interface required
Outlook
prediction of topics with
causal/temporal relations
Friday, 14.06.2013 Topic/S Slide 48
Quelle: ooltapulta.com
Quelle: business-strategy-innovation.com