SlideShare a Scribd company logo
1 of 50
Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Martin Voigt, Michael Aleythe, Peter Wehner
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 1
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 2
Motivation
fink & PARTNER Media Services GmbH
Media management for publishing houses
Some customers
Chair of Multimedia Technology, TU Dresden
Research fields
Adaptive, composite Rich Internet Applications
Semantic document life cycle management
Friday, 14.06.2013 Topic/S Slide 3
Motivation
Newsroom
Friday, 14.06.2013 Topic/S Slide 4
Quelle: ringier.com
Problem
Overwhelming amount of data
e.g., Mainpost 2000 articles/day from agencies
and in-house production
Friday, 14.06.2013 Topic/S
DPA
Reuters
KNA
Twitter
Facebook
Blogs
…
News agencies
Web, social media
…
In-house production
Archive
Online
Slide 5
Problem
Friday, 14.06.2013 Topic/S Slide 6
Problem
Hard to identify topics
Browsing
Keyword-Identification
And their
Relations, Media, and Trend
Friday, 14.06.2013 Topic/S Slide 7
Quelle: Zeit.de
Vision
Automatic topic discovery using Named Entities and
other keywords (Semantic Items, SemItem)
Investigation of trending topics
Push them to the editor
Friday, 14.06.2013 Topic/S
MA1
E1
E2
E4
E3
E7
E6
E5
MA2
Media
Assets
Named
Entities
Pre-Processing
MA1
E1
T1E2
E4
E3
E7
E6
T2
T3
E5
MA2
Media
Assets
Named
Entities
Topics
Pre-Processing Post-Processing
Slide 8
Requirements
Extraction and disambiguation
of (German) SemItems
Model and storage of semantic
information
Topic and trend discovery
Scalable architecture for
business use case
Friday, 14.06.2013 Topic/S Slide 9
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 10
Workflow
Friday, 14.06.2013 Topic/S Slide 11
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 12
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Language Recognition
Based on article content
Support German/English
Rule-based solution:
– Words with capital letter (en 18% vs. de 43%)
– Occurrence of umlauts (ä,ö,ü)
– Existence of language specific words
• en: of, to, and, a, for, the, that
• de: der, das, und, sich, auf
Precision: 99%
Slide 13
Quelle: onelanguageoneposter.com
Workflow: Präprozessor
Friday, 14.06.2013 Topic/S
Keywords
Lemmatization
Developing a word list
Extraction using the word list
Bonus: frequent terms of an article
Slide 14
Quelle: hugdaily.org
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation
Classification of text
One categorizer per news-agency
IPTC categories
Categories useful for identifying topics
Slide 15
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation - Training
Politics
Article IPTC Media Topic Categoriser
Slide 16
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation - Training
Politics
Article IPTC Media Topic Categoriser OTS
Politics
Article IPTC Media Topic Categoriser Reuters
Politics
Article IPTC Media Topic Categoriser DPA
DPA
Reuters
OTS
Slide 17
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation
Politics
Article DPA IPTC Media Topic
Categoriser OTS
Categoriser DPA
Categoriser Reuters
Slide 18
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Categorisation - Quality
News-Agency accuracy
KNA 80,3 %
DPA 94,4 %
EPD 80,3 %
Reuters 90,8 %
OTS 93,5 %
AFP 86 %
Method accuracy
One cat. for all agencies 85 %
One cat. per agency 87,5 %
Slide 19
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Named Entity Recognition
Recognition of persons,
organizations, places
two methods: word list, statistics
additional information:
– occurrence count
– text part NE appeared in
Slide 20
Quelle: churchthought.com
Workflow: Preprocessor
Friday, 14.06.2013 Topic/S
Named Entity Recognition – Approaches
word list
Tool: LingPipe + Extension
Sources: LOD (DBPedia, Geonames, YAGO2)
Advantages: controlled vocabulary,
guarantied recognition of entities
statistics
Tool: Stanford NLP
Source: pre-trained model
Advantage: Recognition of unknown entities
Slide 21
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 22
Semantic Model
Requirements
information life cycle | simple | fast querying |
schema reuse | inference | ...
Foundations
SNaP Ontologies, IPTC NewsCodes, W3C Ontology
for Media Resources, schema.org
RDFS, less OWL
Conventions, versioning, and documentation
Friday, 14.06.2013 Topic/S Slide 23
Semantic Model
Friday, 14.06.2013 Topic/S Slide 24
Storage of Semantic Data
Benchmark of triple stores [Voigt2012]
No benchmark found with real-world data, inference,
SPARQL 1.1, and multi-client
What have we done?
4 datasets, 5 stores, 15 queries per dataset
Loading time, memory requirement, per-query
type & multi-client performance
Result
No clear recommendation, strongly depends on
project requirements
Friday, 14.06.2013 Topic/S Slide 25
Storage of Semantic Data
Using Oracle 11gR2
Pros
Already available, existing knowledge
Nearly as fast as Virtuoso etc.
Integrated querying of relational and
semantic data
Spatial data mining features
Cons
Inference
Incomplete SPARQL 1.1 support
Limited custom rule support
Friday, 14.06.2013 Topic/S Slide 26
Semantic Facts
Named Entities required but no lists available
Manual search, extraction, and
cleaning for named entities from
YAGO2 , Freebase, JRC_Names,
Tagesspiegel, DBpedia
Stored preferred and alternative names
ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller
Names: Rene Muller, Rene Müller, René Muller, René Müller
Friday, 14.06.2013 Topic/S Slide 27
Semantic Facts
BUT only named entities cause bad topics keywords
required, e.g.,
Waffenstillstand (cease-fire), Meister
(champion), Klimaschutz (climate protection), …
Some numbers
Triples without SemItems: 10,3 Mio.
Friday, 14.06.2013 Topic/S
SemItem Number (with alt. names)
Person 590.828 (860.594)
Organization 63.262 (98.052)
Place 89.672 (95.146)
Keyword 1329
Slide 28
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 29
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering
Slide 30
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering
Slide 31
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering
Merkel
Politics
Highway
Traffic
Audi
Obama
Slide 32
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Clustering (Top Cluster 06.06.2013)
Article First
Date
Name Hot
Topic
7 6.6. "Bürgermeister","Gemeinde",
"Gemeinderat", "Kosten"
No
4 6.6. "Abzug", "Bürgerkrieg", "Grenze",
"Soldat", "Österreich", "Syrien", "Tel Aviv",
"Vereinten Nationen"
Yes
3 6.6. "Vertrag", "Vorstandschef","München","FC
Bayern München","FC Bayern München
AG","Olympique Marseille","Daniel Van
Buyten","Franck Ribery","Karl-Heinz
Rummenigge"
Yes
2 4.6. "Ministerpräsident","Protest","Istanbul",
"Tunis","Recep Tayyip Erdogan"
Yes
Slide 33
Workflow: Postprocessor
Friday, 14.06.2013 Topic/S
Topic trend
Date Article SemItems
4.6. 6 "Demonstrant","Ministerpräsident","Protest",
"Regierung","Stadtteil","Istanbul","Recep Tayyip
Erdogan
5.6. 14 "Demonstrant","Protest","Istanbul","Recep
Tayyip Erdogan"
6.6. 2 "Ministerpräsident","Protest","Istanbul","Tunis",
"Recep Tayyip Erdogan"
7.6. 9 "Demonstrant","Protest","Recep Tayyip Erdogan"
Slide 34
Structure
Motivation, Problems, and Goals
Topic/S Workflow
– Overview
– Pre-Processing
– Semantic Model, Facts, and Storage
– Post-Processing
– Search and User Interface
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 35
Workflow: Related Article
Friday, 14.06.2013 Topic/S
Related Article
• Person
• Location
• Organisation
• Keywords
Slide 36
Workflow: Related Article
Friday, 14.06.2013 Topic/S
Related Article - relatedness
• computes topic-based difference between
articles
• Detecting main entities in articles
• navigation recommendation for user
Slide 37
Workflow: Related Article
Friday, 14.06.2013 Topic/S
Related Article - relatedness
Bernd Lucke
Berlin
Occurrence: 1 + 4
Occurrence : 0 + 1
Bernd Lucke
Occurrence : 0 + 4
AfD
Occurrence : 0 + 5
Berlin
Klaus Wowereit
Occurrence : 1 + 3
Occurrence : 1 + 4
Slide 38
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 39
Live Demo
Friday, 14.06.2013 Topic/S Slide 40
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
User Interfaces
Disambiguation
Conclusion
Friday, 14.06.2013 Topic/S Slide 41
Static User Interface
Friday, 14.06.2013 Topic/S Slide 42
Dynamic User Interface
Friday, 14.06.2013 Topic/S Slide 43
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
User Interfaces
ConclusDisambiguationion
Friday, 14.06.2013 Topic/S Slide 44
Disambiguation
Friday, 14.06.2013 Topic/S Slide 45
Quelle: fansshare.comQuelle: lounge.espdisk.com
Quelle: de.wikipedia.org
Disambiguation
Problem: not all SemItems available in the LOD
Friday, 14.06.2013 Topic/S
Michael Jackson
Beer
Michael Jackson
Beer
Whiskey
Michael Jackson
Music
King of Pop
Internal Facts
External Facts
(DBpedia, etc.)
Identification of
Entity Cluster
Slide 46
Structure
Motivation, Problems, and Goals
Topic/S Workflow
Demo
Current and Upcoming Task
Conclusion
Friday, 14.06.2013 Topic/S Slide 47
Sum it up!
Result
Identifying topics and pushing them
to the editor
Lessons learned
NER: bad for non-English,
combination required
model needs to be optimized for
queries
dedicated user interface required
Outlook
prediction of topics with
causal/temporal relations
Friday, 14.06.2013 Topic/S Slide 48
Quelle: ooltapulta.com
Quelle: business-strategy-innovation.com
Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Thanks! Questions?

More Related Content

Viewers also liked

Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2neopreety56prateek
 
Optical Computing Technology
Optical Computing TechnologyOptical Computing Technology
Optical Computing TechnologyKanchan Shinde
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
visible light communication
visible light communicationvisible light communication
visible light communicationHossam Zein
 
Wi-Vi Technology
Wi-Vi TechnologyWi-Vi Technology
Wi-Vi TechnologyAnandhuas
 
Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...Joe Lamantia
 
Artificial Passenger
Artificial PassengerArtificial Passenger
Artificial Passengerpriyanka kini
 
Gesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPTGesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPTSuraj Rai
 
Brain gate technology
Brain gate technologyBrain gate technology
Brain gate technologyPadmaja Dash
 

Viewers also liked (18)

What is covered in a Capacitor Technical Seminar?
What is covered in a Capacitor Technical Seminar?What is covered in a Capacitor Technical Seminar?
What is covered in a Capacitor Technical Seminar?
 
Bluejacking
BluejackingBluejacking
Bluejacking
 
Quantum teleportation
Quantum teleportationQuantum teleportation
Quantum teleportation
 
Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2Optical Camouflage Technology Latest 2
Optical Camouflage Technology Latest 2
 
Bionic technology
Bionic technologyBionic technology
Bionic technology
 
Optical Computing Technology
Optical Computing TechnologyOptical Computing Technology
Optical Computing Technology
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
visible light communication
visible light communicationvisible light communication
visible light communication
 
Teleportation
TeleportationTeleportation
Teleportation
 
Wi vi ppt
Wi vi pptWi vi ppt
Wi vi ppt
 
Wi-Vi Technology
Wi-Vi TechnologyWi-Vi Technology
Wi-Vi Technology
 
Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...Social Interaction Design For Augmented Reality: Patterns and Principles for ...
Social Interaction Design For Augmented Reality: Patterns and Principles for ...
 
PILL CAMERA
PILL CAMERAPILL CAMERA
PILL CAMERA
 
pill camera
pill camerapill camera
pill camera
 
Artificial Passenger
Artificial PassengerArtificial Passenger
Artificial Passenger
 
Gesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPTGesture Recognition Technology-Seminar PPT
Gesture Recognition Technology-Seminar PPT
 
Pill Camera ppt
Pill Camera pptPill Camera ppt
Pill Camera ppt
 
Brain gate technology
Brain gate technologyBrain gate technology
Brain gate technology
 

Similar to Towards Topics-based, Semantics-assisted News Search | WIMS13

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13Fink & Partner Media Services GmbH
 
Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...FutureTDM
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningHeiko Paulheim
 
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...Sebastian Dennerlein
 
Enterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slidesEnterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slidesUniversity St. Gallen
 
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...Rurik Thomas Greenall
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...Open Knowledge Maps
 
How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?Martin Voigt
 
Literature survey andrei_manta_0
Literature survey andrei_manta_0Literature survey andrei_manta_0
Literature survey andrei_manta_0darshanahiren
 
Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011Don Day
 
RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015Research Data Alliance
 
Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020Pedro Príncipe
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilotSarah Jones
 
Topic map for Topic Maps case examples
Topic map for Topic Maps case examplesTopic map for Topic Maps case examples
Topic map for Topic Maps case examplestmra
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your RoleJay Gendron
 
Potentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment AnalysisPotentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment AnalysisKarthik Sharma
 

Similar to Towards Topics-based, Semantics-assisted News Search | WIMS13 (20)

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13
 
Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...Technologies and infrastructures supporting text and data analytics: Challeng...
Technologies and infrastructures supporting text and data analytics: Challeng...
 
Data Mining
Data MiningData Mining
Data Mining
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
Web 2.0 Messaging Tools for Knowledge Management? Exploring the Potentials of...
 
Enterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slidesEnterprise Applications of Text Intelligence - Lecture slides
Enterprise Applications of Text Intelligence - Lecture slides
 
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
Making Future-proof Library Content for the Web: Metadata-driven Workflows an...
 
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
The Student's and Researcher's Guide to Discovery: Exploring Scientific Field...
 
What is „nestor“ ?
What is „nestor“ ?What is „nestor“ ?
What is „nestor“ ?
 
How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?How Could End-Users Identify Interesting Resources?
How Could End-Users Identify Interesting Resources?
 
Literature survey andrei_manta_0
Literature survey andrei_manta_0Literature survey andrei_manta_0
Literature survey andrei_manta_0
 
Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011Content Architecture for Rapid Knowledge Reuse-congility2011
Content Architecture for Rapid Knowledge Reuse-congility2011
 
RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015RDA Members Monthly Statistics - May 2015
RDA Members Monthly Statistics - May 2015
 
Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020Workshop Fraunhofer Portugal on Open Science in Horizon 2020
Workshop Fraunhofer Portugal on Open Science in Horizon 2020
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
Research Project Management
Research Project ManagementResearch Project Management
Research Project Management
 
H2020 Open Research Data pilot
H2020 Open Research Data pilotH2020 Open Research Data pilot
H2020 Open Research Data pilot
 
Topic map for Topic Maps case examples
Topic map for Topic Maps case examplesTopic map for Topic Maps case examples
Topic map for Topic Maps case examples
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Potentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment AnalysisPotentials and limitations of ‘Automated Sentiment Analysis
Potentials and limitations of ‘Automated Sentiment Analysis
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 

Towards Topics-based, Semantics-assisted News Search | WIMS13

  • 1. Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677 Martin Voigt, Michael Aleythe, Peter Wehner
  • 2. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 1
  • 3. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 2
  • 4. Motivation fink & PARTNER Media Services GmbH Media management for publishing houses Some customers Chair of Multimedia Technology, TU Dresden Research fields Adaptive, composite Rich Internet Applications Semantic document life cycle management Friday, 14.06.2013 Topic/S Slide 3
  • 6. Problem Overwhelming amount of data e.g., Mainpost 2000 articles/day from agencies and in-house production Friday, 14.06.2013 Topic/S DPA Reuters KNA Twitter Facebook Blogs … News agencies Web, social media … In-house production Archive Online Slide 5
  • 8. Problem Hard to identify topics Browsing Keyword-Identification And their Relations, Media, and Trend Friday, 14.06.2013 Topic/S Slide 7 Quelle: Zeit.de
  • 9. Vision Automatic topic discovery using Named Entities and other keywords (Semantic Items, SemItem) Investigation of trending topics Push them to the editor Friday, 14.06.2013 Topic/S MA1 E1 E2 E4 E3 E7 E6 E5 MA2 Media Assets Named Entities Pre-Processing MA1 E1 T1E2 E4 E3 E7 E6 T2 T3 E5 MA2 Media Assets Named Entities Topics Pre-Processing Post-Processing Slide 8
  • 10. Requirements Extraction and disambiguation of (German) SemItems Model and storage of semantic information Topic and trend discovery Scalable architecture for business use case Friday, 14.06.2013 Topic/S Slide 9
  • 11. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 10
  • 13. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 12
  • 14. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Language Recognition Based on article content Support German/English Rule-based solution: – Words with capital letter (en 18% vs. de 43%) – Occurrence of umlauts (ä,ö,ü) – Existence of language specific words • en: of, to, and, a, for, the, that • de: der, das, und, sich, auf Precision: 99% Slide 13 Quelle: onelanguageoneposter.com
  • 15. Workflow: Präprozessor Friday, 14.06.2013 Topic/S Keywords Lemmatization Developing a word list Extraction using the word list Bonus: frequent terms of an article Slide 14 Quelle: hugdaily.org
  • 16. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation Classification of text One categorizer per news-agency IPTC categories Categories useful for identifying topics Slide 15
  • 17. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation - Training Politics Article IPTC Media Topic Categoriser Slide 16
  • 18. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation - Training Politics Article IPTC Media Topic Categoriser OTS Politics Article IPTC Media Topic Categoriser Reuters Politics Article IPTC Media Topic Categoriser DPA DPA Reuters OTS Slide 17
  • 19. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation Politics Article DPA IPTC Media Topic Categoriser OTS Categoriser DPA Categoriser Reuters Slide 18
  • 20. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Categorisation - Quality News-Agency accuracy KNA 80,3 % DPA 94,4 % EPD 80,3 % Reuters 90,8 % OTS 93,5 % AFP 86 % Method accuracy One cat. for all agencies 85 % One cat. per agency 87,5 % Slide 19
  • 21. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Named Entity Recognition Recognition of persons, organizations, places two methods: word list, statistics additional information: – occurrence count – text part NE appeared in Slide 20 Quelle: churchthought.com
  • 22. Workflow: Preprocessor Friday, 14.06.2013 Topic/S Named Entity Recognition – Approaches word list Tool: LingPipe + Extension Sources: LOD (DBPedia, Geonames, YAGO2) Advantages: controlled vocabulary, guarantied recognition of entities statistics Tool: Stanford NLP Source: pre-trained model Advantage: Recognition of unknown entities Slide 21
  • 23. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 22
  • 24. Semantic Model Requirements information life cycle | simple | fast querying | schema reuse | inference | ... Foundations SNaP Ontologies, IPTC NewsCodes, W3C Ontology for Media Resources, schema.org RDFS, less OWL Conventions, versioning, and documentation Friday, 14.06.2013 Topic/S Slide 23
  • 26. Storage of Semantic Data Benchmark of triple stores [Voigt2012] No benchmark found with real-world data, inference, SPARQL 1.1, and multi-client What have we done? 4 datasets, 5 stores, 15 queries per dataset Loading time, memory requirement, per-query type & multi-client performance Result No clear recommendation, strongly depends on project requirements Friday, 14.06.2013 Topic/S Slide 25
  • 27. Storage of Semantic Data Using Oracle 11gR2 Pros Already available, existing knowledge Nearly as fast as Virtuoso etc. Integrated querying of relational and semantic data Spatial data mining features Cons Inference Incomplete SPARQL 1.1 support Limited custom rule support Friday, 14.06.2013 Topic/S Slide 26
  • 28. Semantic Facts Named Entities required but no lists available Manual search, extraction, and cleaning for named entities from YAGO2 , Freebase, JRC_Names, Tagesspiegel, DBpedia Stored preferred and alternative names ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller Names: Rene Muller, Rene Müller, René Muller, René Müller Friday, 14.06.2013 Topic/S Slide 27
  • 29. Semantic Facts BUT only named entities cause bad topics keywords required, e.g., Waffenstillstand (cease-fire), Meister (champion), Klimaschutz (climate protection), … Some numbers Triples without SemItems: 10,3 Mio. Friday, 14.06.2013 Topic/S SemItem Number (with alt. names) Person 590.828 (860.594) Organization 63.262 (98.052) Place 89.672 (95.146) Keyword 1329 Slide 28
  • 30. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 29
  • 31. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering Slide 30
  • 32. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering Slide 31
  • 33. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering Merkel Politics Highway Traffic Audi Obama Slide 32
  • 34. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Clustering (Top Cluster 06.06.2013) Article First Date Name Hot Topic 7 6.6. "Bürgermeister","Gemeinde", "Gemeinderat", "Kosten" No 4 6.6. "Abzug", "Bürgerkrieg", "Grenze", "Soldat", "Österreich", "Syrien", "Tel Aviv", "Vereinten Nationen" Yes 3 6.6. "Vertrag", "Vorstandschef","München","FC Bayern München","FC Bayern München AG","Olympique Marseille","Daniel Van Buyten","Franck Ribery","Karl-Heinz Rummenigge" Yes 2 4.6. "Ministerpräsident","Protest","Istanbul", "Tunis","Recep Tayyip Erdogan" Yes Slide 33
  • 35. Workflow: Postprocessor Friday, 14.06.2013 Topic/S Topic trend Date Article SemItems 4.6. 6 "Demonstrant","Ministerpräsident","Protest", "Regierung","Stadtteil","Istanbul","Recep Tayyip Erdogan 5.6. 14 "Demonstrant","Protest","Istanbul","Recep Tayyip Erdogan" 6.6. 2 "Ministerpräsident","Protest","Istanbul","Tunis", "Recep Tayyip Erdogan" 7.6. 9 "Demonstrant","Protest","Recep Tayyip Erdogan" Slide 34
  • 36. Structure Motivation, Problems, and Goals Topic/S Workflow – Overview – Pre-Processing – Semantic Model, Facts, and Storage – Post-Processing – Search and User Interface Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 35
  • 37. Workflow: Related Article Friday, 14.06.2013 Topic/S Related Article • Person • Location • Organisation • Keywords Slide 36
  • 38. Workflow: Related Article Friday, 14.06.2013 Topic/S Related Article - relatedness • computes topic-based difference between articles • Detecting main entities in articles • navigation recommendation for user Slide 37
  • 39. Workflow: Related Article Friday, 14.06.2013 Topic/S Related Article - relatedness Bernd Lucke Berlin Occurrence: 1 + 4 Occurrence : 0 + 1 Bernd Lucke Occurrence : 0 + 4 AfD Occurrence : 0 + 5 Berlin Klaus Wowereit Occurrence : 1 + 3 Occurrence : 1 + 4 Slide 38
  • 40. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 39
  • 41. Live Demo Friday, 14.06.2013 Topic/S Slide 40
  • 42. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task User Interfaces Disambiguation Conclusion Friday, 14.06.2013 Topic/S Slide 41
  • 43. Static User Interface Friday, 14.06.2013 Topic/S Slide 42
  • 44. Dynamic User Interface Friday, 14.06.2013 Topic/S Slide 43
  • 45. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task User Interfaces ConclusDisambiguationion Friday, 14.06.2013 Topic/S Slide 44
  • 46. Disambiguation Friday, 14.06.2013 Topic/S Slide 45 Quelle: fansshare.comQuelle: lounge.espdisk.com Quelle: de.wikipedia.org
  • 47. Disambiguation Problem: not all SemItems available in the LOD Friday, 14.06.2013 Topic/S Michael Jackson Beer Michael Jackson Beer Whiskey Michael Jackson Music King of Pop Internal Facts External Facts (DBpedia, etc.) Identification of Entity Cluster Slide 46
  • 48. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Current and Upcoming Task Conclusion Friday, 14.06.2013 Topic/S Slide 47
  • 49. Sum it up! Result Identifying topics and pushing them to the editor Lessons learned NER: bad for non-English, combination required model needs to be optimized for queries dedicated user interface required Outlook prediction of topics with causal/temporal relations Friday, 14.06.2013 Topic/S Slide 48 Quelle: ooltapulta.com Quelle: business-strategy-innovation.com
  • 50. Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677 Thanks! Questions?