SlideShare a Scribd company logo
Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Martin Voigt, Michael Aleythe, Peter Wehner
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 1
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 2
Motivation
fink & PARTNER Media Services GmbH
Media management for publishing houses
Some customers
Chair of Multimedia Technology, TU Dresden
Research fields
Adaptive, composite Rich Internet Applications
Semantic document life cycle management
Friday, 14.06.2013 Topic/S Slide 3
Motivation
Newsroom
Friday, 14.06.2013 Topic/S Slide 4
Quelle: ringier.com
Problem
Overwhelming amount of data
e.g., Mainpost 2000 articles/day from agencies
and in-house production
Friday, 14.06.2013 Topic/S
DPA
Reuters
KNA
Twitter
Facebook
Blogs
…
News agencies
Web, social media
…
In-house production
Archive
Online
Slide 5
Problem
Friday, 14.06.2013 Topic/S Slide 6
Problem
Hard to identify topics
Browsing
Keyword-Identification
And their
Relations, Media, and Trend
Friday, 14.06.2013 Topic/S Slide 7
Quelle: Zeit.de
Vision
Automatic topic discovery using Named Entities and
other keywords (Semantic Items, SemItem)
Investigation of trending topics
Push them to the editor
Friday, 14.06.2013 Topic/S
MA1
E1
E2
E4
E3
E7
E6
E5
MA2
Media
Assets
Named
Entities
Pre-Processing
MA1
E1
T1E2
E4
E3
E7
E6
T2
T3
E5
MA2
Media
Assets
Named
Entities
Topics
Pre-Processing Post-Processing
Slide 8
Requirements
Extraction and disambiguation
of (German) SemItems
Model and storage of semantic
information
Topic and trend discovery
Scalable architecture for
business use case
Friday, 14.06.2013 Topic/S Slide 9
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 10
Workflow
Friday, 14.06.2013 Topic/S Slide 11
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 12
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Language Recognition
Based on article content
Support German/English
Rule-based solution:
– Words with capital letter (en 18% vs. de 43%)
– Occurrence of umlauts (ä,ö,ü)
– Existence of language specific words
• en: of, to, and, a, for, the, that
• de: der, das, und, sich, auf
Precision: 99%
Slide 13
Quelle: onelanguageoneposter.com
Workflow: Präprozessor
Friday, 14.06.2013 Topic/S
Keywords
Lemmatization
Developing a word list
Extraction using the word list
Bonus: frequent terms of an article
Slide 14
Quelle: hugdaily.org
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation
Classification of text
One categorizer per news-agency
IPTC categories
Categories useful for identifying topics
Slide 15
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation - Training
Politics
Article IPTC Media Topic Categoriser
Slide 16
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation - Training
Politics
Article IPTC Media Topic Categoriser OTS
Politics
Article IPTC Media Topic Categoriser Reuters
Politics
Article IPTC Media Topic Categoriser DPA
DPA
Reuters
OTS
Slide 17
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation
Politics
Article DPA IPTC Media Topic
Categoriser OTS
Categoriser DPA
Categoriser Reuters
Slide 18
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation - Quality
News-Agency accuracy
KNA 80,3 %
DPA 94,4 %
EPD 80,3 %
Reuters 90,8 %
OTS 93,5 %
AFP 86 %
Method accuracy
One cat. for all agencies 85 %
One cat. per agency 87,5 %
Slide 19
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Named Entity Recognition
Recognition of persons,
organizations, places
two methods: word list, statistics
additional information:
– occurrence count
– text part NE appeared in
Slide 20
Quelle: churchthought.com
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Named Entity Recognition – Approaches
word list
Tool: LingPipe + Extension
Sources: LOD (DBPedia, Geonames, YAGO2)
Advantages: controlled vocabulary,
guarantied recognition of entities
statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Slide 21
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 22
Semantic Model
Requirements
information life cycle | simple | fast querying |
schema reuse | inference | ...
Foundations
SNaP Ontologies, IPTC NewsCodes, W3C Ontology
for Media Resources, schema.org
RDFS, less OWL
Conventions, versioning, and documentation
Friday, 14.06.2013 Topic/S Slide 23
Semantic Model
Friday, 14.06.2013 Topic/S Slide 24
Storage of Semantic Data
Benchmark of triple stores [Voigt2012]
No benchmark found with real-world data, inference,
SPARQL 1.1, and multi-client
What have we done?
4 datasets, 5 stores, 15 queries per dataset
Loading time, memory requirement, per-query
type & multi-client performance
Result
No clear recommendation, strongly depends on
project requirements
Friday, 14.06.2013 Topic/S Slide 25
Storage of Semantic Data
Using Oracle 11gR2
Pros
Already available, existing knowledge
Nearly as fast as Virtuoso etc.
Integrated querying of relational and
semantic data
Spatial data mining features
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support
Friday, 14.06.2013 Topic/S Slide 26
Semantic Facts
Named Entities required but no lists available
Manual search, extraction, and
cleaning for named entities from
YAGO2 , Freebase, JRC_Names,
Tagesspiegel, DBpedia
Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller
Friday, 14.06.2013 Topic/S Slide 27
Semantic Facts
BUT only named entities cause bad topics keywords
required, e.g.,
Waffenstillstand (cease-fire), Meister
(champion), Klimaschutz (climate protection), …
Some numbers
Triples without SemItems: 10,3 Mio.
Friday, 14.06.2013 Topic/S
SemItem Number (with alt. names)
Person 590.828 (860.594)
Organization 63.262 (98.052)
Place 89.672 (95.146)
Keyword 1329
Slide 28
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 29
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering
Slide 30
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering
Slide 31
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering
Merkel
Politics
Highway
Traffic
Audi
Obama
Slide 32
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering (Top Cluster 06.06.2013)
Article First
Date
Name Hot
Topic
7 6.6. "Bürgermeister","Gemeinde",
"Gemeinderat", "Kosten"
No
4 6.6. "Abzug", "Bürgerkrieg", "Grenze",
"Soldat", "Österreich", "Syrien", "Tel Aviv",
"Vereinten Nationen"
Yes
3 6.6. "Vertrag", "Vorstandschef","München","FC
Bayern München","FC Bayern München
AG","Olympique Marseille","Daniel Van
Buyten","Franck Ribery","Karl-Heinz
Rummenigge"
Yes
2 4.6. "Ministerpräsident","Protest","Istanbul",
"Tunis","Recep Tayyip Erdogan"
Yes
Slide 33
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Topic trend
Date Article SemItems
4.6. 6 "Demonstrant","Ministerpräsident","Protest",
"Regierung","Stadtteil","Istanbul","Recep Tayyip
Erdogan
5.6. 14 "Demonstrant","Protest","Istanbul","Recep
Tayyip Erdogan"
6.6. 2 "Ministerpräsident","Protest","Istanbul","Tunis",
"Recep Tayyip Erdogan"
7.6. 9 "Demonstrant","Protest","Recep Tayyip Erdogan"
Slide 34
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 35
Workflow: Related Article
Friday, 14.06.2013 Topic/S
Related Article
• Person
• Location
• Organisation
• Keywords
Slide 36
Workflow: Related Article
Friday, 14.06.2013 Topic/S
Related Article - relatedness
• computes topic-based difference between
articles
• Detecting main entities in articles
• navigation recommendation for user
Slide 37
Workflow: Related Article
Friday, 14.06.2013 Topic/S
Related Article - relatedness
Bernd Lucke
Berlin
Occurrence: 1 + 4
Occurrence : 0 + 1
Bernd Lucke
Occurrence : 0 + 4
AfD
Occurrence : 0 + 5
Berlin
Klaus Wowereit
Occurrence : 1 + 3
Occurrence : 1 + 4
Slide 38
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 39
Live Demo
Friday, 14.06.2013 Topic/S Slide 40
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
User Interfaces
Disambiguation
Conclusion
Friday, 14.06.2013 Topic/S Slide 41
Static User Interface
Friday, 14.06.2013 Topic/S Slide 42
Dynamic User Interface
Friday, 14.06.2013 Topic/S Slide 43
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
User Interfaces
ConclusDisambiguationion
Friday, 14.06.2013 Topic/S Slide 44
Disambiguation
Friday, 14.06.2013 Topic/S Slide 45
Quelle: fansshare.comQuelle: lounge.espdisk.com
Quelle: de.wikipedia.org
Disambiguation
Problem: not all SemItems available in the LOD
Friday, 14.06.2013 Topic/S
Michael Jackson
Beer
Michael Jackson
Beer
Whiskey
Michael Jackson
Music
King of Pop
Internal Facts
External Facts
(DBpedia, etc.)
Identification of
Entity Cluster
Slide 46
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 47
Sum it up!
Result
Identifying topics and pushing them
to the editor
Lessons learned
NER: bad for non-English,
combination required
model needs to be optimized for
queries
dedicated user interface required
Outlook
prediction of topics with
causal/temporal relations
Friday, 14.06.2013 Topic/S Slide 48
Quelle: ooltapulta.com
Quelle: business-strategy-innovation.com
Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Thanks! Questions?

More Related Content

Viewers also liked

What is covered in a Capacitor Technical Seminar?
What is covered in a Capacitor Technical Seminar?What is covered in a Capacitor Technical Seminar?
What is covered in a Capacitor Technical Seminar?
KEMET Electronics Corporation
 
Quantum teleportation
Quantum teleportationQuantum teleportation
Quantum teleportation
Biswajit Pratihari
 
Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2neopreety56prateek
 
Bionic technology
Bionic technologyBionic technology
Bionic technology
Pritam Patange
 
Optical Computing Technology
Optical Computing TechnologyOptical Computing Technology
Optical Computing TechnologyKanchan Shinde
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
Richie
 
visible light communication
visible light communicationvisible light communication
visible light communication
Hossam Zein
 
Teleportation
TeleportationTeleportation
Teleportation
nayakslideshare
 
Wi-Vi Technology
Wi-Vi TechnologyWi-Vi Technology
Wi-Vi Technology
Anandhuas
 
Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Joe Lamantia
 
Artificial Passenger
Artificial PassengerArtificial Passenger
Artificial Passengerpriyanka kini
 
Gesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPTGesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPT
Suraj Rai
 
Pill Camera ppt
Pill Camera pptPill Camera ppt
Pill Camera ppt
Avinash Kunapareddy
 
Brain gate technology
Brain gate technologyBrain gate technology
Brain gate technology
Padmaja Dash
 

Viewers also liked (18)

What is covered in a Capacitor Technical Seminar?
What is covered in a Capacitor Technical Seminar?What is covered in a Capacitor Technical Seminar?
What is covered in a Capacitor Technical Seminar?
 
Bluejacking
BluejackingBluejacking
Bluejacking
 
Quantum teleportation
Quantum teleportationQuantum teleportation
Quantum teleportation
 
Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2
 
Bionic technology
Bionic technologyBionic technology
Bionic technology
 
Optical Computing Technology
Optical Computing TechnologyOptical Computing Technology
Optical Computing Technology
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
visible light communication
visible light communicationvisible light communication
visible light communication
 
Teleportation
TeleportationTeleportation
Teleportation
 
Wi vi ppt
Wi vi pptWi vi ppt
Wi vi ppt
 
Wi-Vi Technology
Wi-Vi TechnologyWi-Vi Technology
Wi-Vi Technology
 
Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...
 
PILL CAMERA
PILL CAMERAPILL CAMERA
PILL CAMERA
 
pill camera
pill camerapill camera
pill camera
 
Artificial Passenger
Artificial PassengerArtificial Passenger
Artificial Passenger
 
Gesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPTGesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPT
 
Pill Camera ppt
Pill Camera pptPill Camera ppt
Pill Camera ppt
 
Brain gate technology
Brain gate technologyBrain gate technology
Brain gate technology
 

Similar to Towards Topics-based, Semantics-assisted News Search | WIMS13

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Fink & Partner Media Services GmbH
 
Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...
FutureTDM
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
Heiko Paulheim
 
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Sebastian Dennerlein
 
Enterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slidesEnterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slides
University St. Gallen
 
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Rurik Thomas Greenall
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
Open Knowledge Maps
 
What is „nestor“ ?
What is „nestor“ ?What is „nestor“ ?
What is „nestor“ ?
DigitalPreservationEurope
 
How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?
Martin Voigt
 
Literature survey andrei_manta_0
Literature survey andrei_manta_0Literature survey andrei_manta_0
Literature survey andrei_manta_0
darshanahiren
 
Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011
Don Day
 
RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015
Research Data Alliance
 
Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Pedro Príncipe
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
Juuso Parkkinen
 
Research Project Management
Research Project ManagementResearch Project Management
Research Project Management
KEDGE Business School
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilot
Sarah Jones
 
Topic map for Topic Maps case examples
Topic map for Topic Maps case examplesTopic map for Topic Maps case examples
Topic map for Topic Maps case examples
tmra
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
Jay Gendron
 
Potentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment AnalysisPotentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment Analysis
Karthik Sharma
 

Similar to Towards Topics-based, Semantics-assisted News Search | WIMS13 (20)

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
 
Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...
 
Data Mining
Data MiningData Mining
Data Mining
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
 
Enterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slidesEnterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slides
 
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
 
What is „nestor“ ?
What is „nestor“ ?What is „nestor“ ?
What is „nestor“ ?
 
How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?
 
Literature survey andrei_manta_0
Literature survey andrei_manta_0Literature survey andrei_manta_0
Literature survey andrei_manta_0
 
Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011
 
RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015
 
Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Research Project Management
Research Project ManagementResearch Project Management
Research Project Management
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilot
 
Topic map for Topic Maps case examples
Topic map for Topic Maps case examplesTopic map for Topic Maps case examples
Topic map for Topic Maps case examples
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Potentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment AnalysisPotentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment Analysis
 

Recently uploaded

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 

Recently uploaded (20)

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 

Towards Topics-based, Semantics-assisted News Search | WIMS13

  • 1. Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677 Martin Voigt, Michael Aleythe, Peter Wehner
  • 2. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 1
  • 3. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 2
  • 4. Motivation fink & PARTNER Media Services GmbH Media management for publishing houses Some customers Chair of Multimedia Technology, TU Dresden Research fields Adaptive, composite Rich Internet Applications Semantic document life cycle management Friday, 14.06.2013 Topic/S Slide 3
  • 6. Problem Overwhelming amount of data e.g., Mainpost 2000 articles/day from agencies and in-house production Friday, 14.06.2013 Topic/S DPA Reuters KNA Twitter Facebook Blogs … News agencies Web, social media … In-house production Archive Online Slide 5
  • 8. Problem Hard to identify topics Browsing Keyword-Identification And their Relations, Media, and Trend Friday, 14.06.2013 Topic/S Slide 7 Quelle: Zeit.de
  • 9. Vision Automatic topic discovery using Named Entities and other keywords (Semantic Items, SemItem) Investigation of trending topics Push them to the editor Friday, 14.06.2013 Topic/S MA1 E1 E2 E4 E3 E7 E6 E5 MA2 Media Assets Named Entities Pre-Processing MA1 E1 T1E2 E4 E3 E7 E6 T2 T3 E5 MA2 Media Assets Named Entities Topics Pre-Processing Post-Processing Slide 8
  • 10. Requirements Extraction and disambiguation of (German) SemItems Model and storage of semantic information Topic and trend discovery Scalable architecture for business use case Friday, 14.06.2013 Topic/S Slide 9
  • 11. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 10
  • 13. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 12
  • 14. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Language Recognition Based on article content Support German/English Rule-based solution: – Words with capital letter (en 18% vs. de 43%) – Occurrence of umlauts (ä,ö,ü) – Existence of language specific words • en: of, to, and, a, for, the, that • de: der, das, und, sich, auf Precision: 99% Slide 13 Quelle: onelanguageoneposter.com
  • 15. Workflow: Präprozessor Friday, 14.06.2013 Topic/S Keywords Lemmatization Developing a word list Extraction using the word list Bonus: frequent terms of an article Slide 14 Quelle: hugdaily.org
  • 16. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation Classification of text One categorizer per news-agency IPTC categories Categories useful for identifying topics Slide 15
  • 17. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation - Training Politics Article IPTC Media Topic Categoriser Slide 16
  • 18. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation - Training Politics Article IPTC Media Topic Categoriser OTS Politics Article IPTC Media Topic Categoriser Reuters Politics Article IPTC Media Topic Categoriser DPA DPA Reuters OTS Slide 17
  • 19. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation Politics Article DPA IPTC Media Topic Categoriser OTS Categoriser DPA Categoriser Reuters Slide 18
  • 20. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation - Quality News-Agency accuracy KNA 80,3 % DPA 94,4 % EPD 80,3 % Reuters 90,8 % OTS 93,5 % AFP 86 % Method accuracy One cat. for all agencies 85 % One cat. per agency 87,5 % Slide 19
  • 21. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Named Entity Recognition Recognition of persons, organizations, places two methods: word list, statistics additional information: – occurrence count – text part NE appeared in Slide 20 Quelle: churchthought.com
  • 22. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Named Entity Recognition – Approaches word list Tool: LingPipe + Extension Sources: LOD (DBPedia, Geonames, YAGO2) Advantages: controlled vocabulary, guarantied recognition of entities statistics Tool: Stanford NLP Source: pre-trained model Advantage: Recognition of unknown entities Slide 21
  • 23. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 22
  • 24. Semantic Model Requirements information life cycle | simple | fast querying | schema reuse | inference | ... Foundations SNaP Ontologies, IPTC NewsCodes, W3C Ontology for Media Resources, schema.org RDFS, less OWL Conventions, versioning, and documentation Friday, 14.06.2013 Topic/S Slide 23
  • 26. Storage of Semantic Data Benchmark of triple stores [Voigt2012] No benchmark found with real-world data, inference, SPARQL 1.1, and multi-client What have we done? 4 datasets, 5 stores, 15 queries per dataset Loading time, memory requirement, per-query type & multi-client performance Result No clear recommendation, strongly depends on project requirements Friday, 14.06.2013 Topic/S Slide 25
  • 27. Storage of Semantic Data Using Oracle 11gR2 Pros Already available, existing knowledge Nearly as fast as Virtuoso etc. Integrated querying of relational and semantic data Spatial data mining features Cons Inference Incomplete SPARQL 1.1 support Limited custom rule support Friday, 14.06.2013 Topic/S Slide 26
  • 28. Semantic Facts Named Entities required but no lists available Manual search, extraction, and cleaning for named entities from YAGO2 , Freebase, JRC_Names, Tagesspiegel, DBpedia Stored preferred and alternative names ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller Names: Rene Muller, Rene Müller, René Muller, René Müller Friday, 14.06.2013 Topic/S Slide 27
  • 29. Semantic Facts BUT only named entities cause bad topics keywords required, e.g., Waffenstillstand (cease-fire), Meister (champion), Klimaschutz (climate protection), … Some numbers Triples without SemItems: 10,3 Mio. Friday, 14.06.2013 Topic/S SemItem Number (with alt. names) Person 590.828 (860.594) Organization 63.262 (98.052) Place 89.672 (95.146) Keyword 1329 Slide 28
  • 30. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 29
  • 31. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering Slide 30
  • 32. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering Slide 31
  • 33. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering Merkel Politics Highway Traffic Audi Obama Slide 32
  • 34. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering (Top Cluster 06.06.2013) Article First Date Name Hot Topic 7 6.6. "Bürgermeister","Gemeinde", "Gemeinderat", "Kosten" No 4 6.6. "Abzug", "Bürgerkrieg", "Grenze", "Soldat", "Österreich", "Syrien", "Tel Aviv", "Vereinten Nationen" Yes 3 6.6. "Vertrag", "Vorstandschef","München","FC Bayern München","FC Bayern München AG","Olympique Marseille","Daniel Van Buyten","Franck Ribery","Karl-Heinz Rummenigge" Yes 2 4.6. "Ministerpräsident","Protest","Istanbul", "Tunis","Recep Tayyip Erdogan" Yes Slide 33
  • 35. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Topic trend Date Article SemItems 4.6. 6 "Demonstrant","Ministerpräsident","Protest", "Regierung","Stadtteil","Istanbul","Recep Tayyip Erdogan 5.6. 14 "Demonstrant","Protest","Istanbul","Recep Tayyip Erdogan" 6.6. 2 "Ministerpräsident","Protest","Istanbul","Tunis", "Recep Tayyip Erdogan" 7.6. 9 "Demonstrant","Protest","Recep Tayyip Erdogan" Slide 34
  • 36. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 35
  • 37. Workflow: Related Article Friday, 14.06.2013 Topic/S Related Article • Person • Location • Organisation • Keywords Slide 36
  • 38. Workflow: Related Article Friday, 14.06.2013 Topic/S Related Article - relatedness • computes topic-based difference between articles • Detecting main entities in articles • navigation recommendation for user Slide 37
  • 39. Workflow: Related Article Friday, 14.06.2013 Topic/S Related Article - relatedness Bernd Lucke Berlin Occurrence: 1 + 4 Occurrence : 0 + 1 Bernd Lucke Occurrence : 0 + 4 AfD Occurrence : 0 + 5 Berlin Klaus Wowereit Occurrence : 1 + 3 Occurrence : 1 + 4 Slide 38
  • 40. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 39
  • 41. Live Demo Friday, 14.06.2013 Topic/S Slide 40
  • 42. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task User Interfaces Disambiguation Conclusion Friday, 14.06.2013 Topic/S Slide 41
  • 43. Static User Interface Friday, 14.06.2013 Topic/S Slide 42
  • 44. Dynamic User Interface Friday, 14.06.2013 Topic/S Slide 43
  • 45. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task User Interfaces ConclusDisambiguationion Friday, 14.06.2013 Topic/S Slide 44
  • 46. Disambiguation Friday, 14.06.2013 Topic/S Slide 45 Quelle: fansshare.comQuelle: lounge.espdisk.com Quelle: de.wikipedia.org
  • 47. Disambiguation Problem: not all SemItems available in the LOD Friday, 14.06.2013 Topic/S Michael Jackson Beer Michael Jackson Beer Whiskey Michael Jackson Music King of Pop Internal Facts External Facts (DBpedia, etc.) Identification of Entity Cluster Slide 46
  • 48. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 47
  • 49. Sum it up! Result Identifying topics and pushing them to the editor Lessons learned NER: bad for non-English, combination required model needs to be optimized for queries dedicated user interface required Outlook prediction of topics with causal/temporal relations Friday, 14.06.2013 Topic/S Slide 48 Quelle: ooltapulta.com Quelle: business-strategy-innovation.com
  • 50. Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677 Thanks! Questions?