SlideShare a Scribd company logo
1 of 53
Global Media Monitoring 
http://eventregistry.org/ 
Marko Grobelnik 
Marko.Grobelnik@ijs.si 
Jozef Stefan Institute 
Ljubljana, Slovenia 
Contributions from Gregor Leban, Blaz Fortuna, Janez Brank, Jan Rupnik, Andrej Muhic, Aljaz Kosmerlj 
ESWC Summer School, Sep 2nd 2014, Kalamaki
Outline 
• Introduction 
• Collecting Media Data 
• Document Enrichment 
• Cross-linguality 
• News Reporting Bias 
• Event Representation 
• Event Tracking 
• Future Projects
Introduction
What questions we’ll try to answer? 
• Where to get global media data? 
• What is extractable from media documents? 
• How to connect information across languages? 
• What is an event? 
• How to approach diversity in news reporting? 
• How to visualize global event dynamics?
Systems/Demos used within the presentation 
• NewsFeed (http://newsfeed.ijs.si/) 
• News and social media crawler 
• Enrycher (http://enrycher.ijs.si/) 
• Language and Semantic annotation 
• XLing (http://xling.ijs.si/ 
• Cross-lingual document linking and categorization 
• DiversiNews (http://aidemo.ijs.si/diversinews/) 
• News Diversity Explorer 
• Event Registry (http://eventregistry.org/) 
• Event detection and topic tracking
The overall goal 
• The goal is to establish a real-time system 
• …to collect data from global media in real-time 
• …to identify events and track evolving topics 
• …to assign stable identifiers to events 
• …to identify events across languages 
• …to detect diversity of reporting along several dimensions 
• …to provide rich exploratory visualizations 
• …to provide interoperable data export
Main 
stream 
news 
Blogs 
Global Media Monitoring pipeline 
Article semantic 
annotation 
Cross-lingual 
article matching 
Cross-lingual 
cluster matching Event registry 
Event formation 
API Interface 
Event info. 
extraction 
Input 
data 
Pre-processing 
steps 
Event 
construction 
Event storage & 
maintenance 
Extraction of 
date references 
Article clustering 
Identifying 
related events 
Detection of 
article duplicates 
GUI/Visualizations 
http://EventRegistry.org
Collecting Media Data 
http://newsfeed.ijs.si/
Where to get references to news publishers? 
• Good start is Wikipedia list of newspapers: 
• http://en.wikipedia.org/wiki/Lists_of_newspapers
From a newspaper home-page to an article 
http://www.nytimes.com/ HTML 
RSS Feed 
(list of articles) 
Article to be retrieved
Collecting global media data 
• Data collection service News-Feed 
• http://newsfeed.ijs.si/ 
• …crawling global main-stream and social media 
• Monitoring 
• ~60k main-stream publishers (RSS feeds+special feeds) 
• ~250k most influential blogs (RSS feeds) 
• free Twitter feed 
• Data volume: ~350k articles & blogs per day (+5M tweets) 
• Languages: eng (50%), ger (10%), spa (8%), fra (5%)
Downloading the news stream (1/2) 
• The stream is accessible at http://newsfeed.ijs.si/stream/ 
• To download the whole stream continuously, you can use the python 
script (http://newsfeed.ijs.si/http2fs.py) 
• The script does the following:
Downloading the news 
stream (2/2) 
• News Stream Contents and Format 
• The root element, <article-set>, contains zero 
or more articles in the following XML format: 
• …more details: 
• Trampus, Mitja and Novak, Blaz: The Internals 
Of An Aggregated Web News Feed. 
Proceedings of 15th Multiconference on 
Information Society 2012 (IS-2012). [PDF]
Document Enrichment 
http://enrycher.ijs.si/
What can extracted from a document? 
• Lexical level 
• Tokenization – extracting tokens from a document (words, separators, …) 
• Sentence splitting – set of sentences to be further processed 
• Linguistic level 
• Part-of-Speech – assigning word types (nouns, verbs, adjectives, …) 
• Deep Parsing – constructing parse trees from sentences 
• Triple extraction – subject-predicate-object triple extraction 
• Name entity extraction – identifying names of people, places, organizations 
• Semantic level 
• Co-reference resolution – replacing pronouns with corresponding names; 
merging different surface forms of names into single entity 
• Semantic labeling – assigning semantic identifiers to names (e.g. 
LOD/DBpedia/Freebase) including disambiguation 
• Topic classification – assigning topic categories to a document (e.g. DMoz) 
• Summarization – assigning importance to parts of a document 
• Fact extraction – extracting relevant facts from a document
Enrycher (http://enrycher.ijs.si/) 
Plain text 
Extracted graph of triples from text 
Text 
Enrichment 
Diego Maradona Semantics: 
owl:sameAs: http://dbpedia.org/resource/Diego_Maradona 
owl:sameAs: http://sw.opencyc.org/concept/Mx4rvofERZwpEbGdrcN5Y29ycA 
rdf:type: http://dbpedia.org/class/yago/ArgentinaInternationalFootballers 
rdf:type: http://dbpedia.org/class/yago/ArgentineExpatriatesInItaly 
rdf:type: http://dbpedia.org/class/yago/ArgentineFootballManagers 
rdf:type: http://dbpedia.org/class/yago/ArgentineFootballers 
Robbie Keane Semantics: 
owl:sameAs: http://dbpedia.org/resource/Robbie_Keane 
rdf:type: http://dbpedia.org/class/yago/CoventryCityF.C.Players 
rdf:type: http://dbpedia.org/class/yago/ExpatriateFootballPlayersInItaly 
rdf:type: http://dbpedia.org/class/yago/F.C.InternazionaleMilanoPlayers 
“Enrycher” is available as 
as a web-service generating 
Semantic Graph, LOD links, 
Entities, Keywords, Categories, 
Text Summarization, Sentiment
Enrycher 
Architecture 
Plain text 
• Enrycher is a web service consisting 
of a set of interlinked modules… 
• …covering lexical, linguistic and 
semantic annotations 
• …exporting data in XML or RDF 
• To execute the service, one should 
send an HTTP POST request, with 
the raw text in the body: 
• curl -d “Enrycher was 
developed at JSI, a 
research institute in 
Ljubljana. Ljubljana is 
the capital of Slovenia.” 
http://enrycher.ijs.si/run 
Annotated document
Cross-linguality 
http://xling.ijs.si/
Cross-linguality 
How to operate in many languages? 
• Cross-linguality is a set of functions on how to transfer information 
across the languages 
• …having this, we can track information independent of the language borders 
• Machine Translation is expensive and slow, so the goal is to avoid machine 
translation to gain speed and scale 
• The key building block is the function for comparing and 
categorization of documents in different languages 
• XLing.ijs.si is an open web service to bridge information across 100 languages
Languages covered by XLing 
(top 100 Wikipedia languages)
XLing (XLing.ijs.si) 
service for comparing and categorization of documents across 100 languages 
Chinese 
Text 
English 
Text 
Automatically 
Extracted 
Keywords 
Automatically 
Extracted 
Keywords 
Similarity 
Between Two 
Documents 
Selection 
Of 100 
Languages
News Reporting Bias 
http://aidemo.ijs.si/diversinews/
News Reporting Bias example
Detecting News Reporting Bias 
• The task: 
• Given a news story, are we able to say from which news source it came? 
• We compared CNN and Aljazeera reports about the same events from 
the war in Iraq 
• …300 aligned articles describing the same story from both sources 
• The same topics are expressed in both sources with the following 
keywords: 
• CNN with: 
• Insurgents, Troops, Baghdad, Iran, Militant, Police, Suicide, Terrorist, United, National, 
Hussein, Alleged, Israeli, Syria, Terrorism… 
• Aljazeera with: 
• Attacks, Claims, Rebels, Withdrawing, Report, Fighters, President, Resistance, Occupation, 
Injured, Army, Demanded, Hit, Muslim, …
DiversiNews iPad App (1/2) 
• DiversiNews iPad App is using 
newsfeed.ijs.si and 
enrycher.ijs.si services 
• …in its initial screen is shows 
list of current hot topics and 
current trending events 
Hot Topics 
Trending 
Events
DiversiNews iPad App (2/2) 
• DiversiNews “diversity search” 
screen allows dynamic 
reranking of articles describing 
an event along three 
dimensions: 
• Geography – where is a content 
being published from 
• Subtopics – what are subtopics 
of an event 
• Sentiment – what are good and 
what are bad news 
• For each query it provides 
• Automatically generated 
summary 
• List of corresponding articles 
Geography 
Subtopics 
Sentiment 
Summary 
Articles
Event Representation 
http://eventregistry.org/
What is an event? 
(abstract description) 
• …more practical question: what definition of is computationally feasible? 
• In general, an event is something which “sticks out” of the average in some 
kind of (high dimensional) data space 
• …could be interpreted as an “anomaly” 
• …densification of data points (e.g. many similar documents) 
• …significant change of distribution (e.g. a trend on Twitter) 
• In practice, the event could be: 
• A cluster od documents / change of a distribution in data 
• Detected in an unsupervised way 
• A fit to a pre-built model 
• Detected in a supervised way
How to represent an event? 
• Baseline data for a news event is usually a cluster of documents 
• …with some preprocessing we extract linguistic and semantic annotations 
• …semantic annotations are linked to ontologies providing possibility for 
multiresolution annotations 
• Three levels of event representation: 
• Feature vector event representation: 
• …light weight representation that can be easily represented as a set of feature vectors 
augmented with external ontologies – suitable for scalable ML analysis 
• Structured event representation: 
• Infobox representation (slots filling) using open schema or event taxonomy 
• Deep event representation 
• Semantic representation linked to a world-model (e.g. CycKB common sense knowledge) 
– suitable for reasoning and diagnostics
Feature vector event representation 
• Feature vectors easily extractable from news documents: 
• Topical dimension – what is being talked about? (keywords) 
• Social dimension – which entities are mentioned? (named entities) 
• Temporal aspect – what is the time of an event? (temporal distribution) 
• Geographical aspect – where an event is taking place? (location) 
• Publisher aspect – who is reporting? (publisher identifiers) 
• Sentiment/bias aspect – emotional signals (numeric estimates) 
• Scalable Machine Learning techniques can easily deal with such 
representation 
• …in “Event Registry” system we use this representation to describe events
Example of “feature vector” event representation: Event Registry “Chicago” related events 
Where? 
(geography) 
When? 
(temporal 
distribution) 
Who? 
(named 
entities) 
What? 
(keyword/ 
topics) 
Query: 
“Chicago”
Structured event representation 
• Structured event representation describes an event 
by its “Event Type” and corresponding information 
slots to be filled 
• Event Types should be taken from “Event Taxonomy” 
• …at this stage of development this level of 
representation still requires human intervention to 
achieve high accuracy (Precision/Recall) extraction 
• Example on the right – Wikipedia event infobox: 
• 2011 Tōhoku earthquake and tsunami
“Event Taxonomy” – preview 
to the current development
Prototype for event Infobox extraction: 
XLike annotation service 
• The goal is to build a 
system for economically 
viable extraction of 
event infoboxes 
• …using crowd-sourcing 
• …aiming at high Precision 
& Recall for a small cost
Event sequences & Hierarchical events 
• Once having events identifies and represented we can connect events 
into “event sequences” (also called story-lines) 
• “Event sequences” include events which are supposedly related and 
constitute larger story 
• Collection of interrelated events can be also organized in hierarchies 
(e.g. World Cup event consists from a series of smaller events)
An example event: Microsoft Windows 9
Similar events example: similar events to 
Microsoft Windows 9 event
Event sequence identification
Hierarchy of events
Example 
Microsoft 
hierarchy of 
events
Zoom-in Example 
Microsoft 
hierarchy of events
Event Tracking 
http://eventregistry.org/
Live Event tracking with http://EventRegistry.org/
Event description through entities and Semantic keywords
Collection of events 
described through 
Entity relatedness
Collection of events 
described through 
trending concepts
Collection of events 
described through 
three level categorization
Events identified across languages
Collection of events 
described through 
Reporting dynamics
Collection of events 
described through 
a story-line of related events
Event Registry exports event data through API 
and RDF/Storyline ontology 
• API to search and export event information 
• Export of all the system data in JSON 
• Event data is exported in a structured form 
• BBC Storyline ontology 
• http://www.bbc.co.uk/ontologies/storyline/2013-05-01.html 
• SPARQL endpoint: 
• http://eventregistry.org/rdf/search 
• http://eventregistry.org/rdf/event/{eventID} 
• http://eventregistry.org/rdf/article/{articleD} 
• http://eventregistry.org/rdf/storyline/{storylineID} 
• Example: http://eventregistry.org/rdf/event/1234
Future / Follow-up projects
Some of the follow-up projects 
• Understanding global social dynamics 
• How global society functions? 
• Integrating text-based media with TV channels 
• …requires speech recognition, video processing, visual object recognition, 
face recognition, … 
• Event prediction / Event-Consequence prediction 
• …requires understanding of causality in the social dynamics and much more 
• Micro-reading / Machine-reading 
• …full understanding of individual documents – the goal for 10+ years

More Related Content

What's hot

Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...Jenn Riley
 
What is digital humanities ? What's doing into English departments?
What is digital humanities ? What's doing into English departments?What is digital humanities ? What's doing into English departments?
What is digital humanities ? What's doing into English departments?gondasmita
 
MA in Digital Humanities
MA in Digital Humanities MA in Digital Humanities
MA in Digital Humanities Paul Spence
 
Vuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachersVuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachersRiina Vuorikari
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Introduction to nlp
Introduction to nlpIntroduction to nlp
Introduction to nlpAmaan Shaikh
 
Semantic engagement handouts
Semantic engagement handoutsSemantic engagement handouts
Semantic engagement handoutsSTIinnsbruck
 
Zoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesZoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesDukeDigitalScholarship
 
Modern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesModern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesZOLLHOF - Tech Incubator
 

What's hot (10)

Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
Digital Libraries, Digital Archives, Digital Humanities, Digital Scholarship:...
 
What is digital humanities ? What's doing into English departments?
What is digital humanities ? What's doing into English departments?What is digital humanities ? What's doing into English departments?
What is digital humanities ? What's doing into English departments?
 
MA in Digital Humanities
MA in Digital Humanities MA in Digital Humanities
MA in Digital Humanities
 
Vuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachersVuorikari Multilingual Tagging behaviour by teachers
Vuorikari Multilingual Tagging behaviour by teachers
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
Introduction to nlp
Introduction to nlpIntroduction to nlp
Introduction to nlp
 
Semantic engagement handouts
Semantic engagement handoutsSemantic engagement handouts
Semantic engagement handouts
 
Zoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and TechniquesZoss High-Level Text Analysis and Techniques
Zoss High-Level Text Analysis and Techniques
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Modern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutesModern text mining – understanding a million comments in 60 minutes
Modern text mining – understanding a million comments in 60 minutes
 

Viewers also liked

Self-sourced Software_Catalysts For Change Zone of Future Innovtion
Self-sourced Software_Catalysts For Change Zone of Future InnovtionSelf-sourced Software_Catalysts For Change Zone of Future Innovtion
Self-sourced Software_Catalysts For Change Zone of Future InnovtionInstitute for the Future
 
Gameful Cities_Catalysts For Change Zone of Future Innovtion
Gameful Cities_Catalysts For Change Zone of Future InnovtionGameful Cities_Catalysts For Change Zone of Future Innovtion
Gameful Cities_Catalysts For Change Zone of Future InnovtionInstitute for the Future
 
Short Workshop on Transforming Ideas to Intellectual Property
Short Workshop on Transforming Ideas to Intellectual PropertyShort Workshop on Transforming Ideas to Intellectual Property
Short Workshop on Transforming Ideas to Intellectual PropertyInnomantra
 
EdVard munch presentazione
EdVard munch presentazioneEdVard munch presentazione
EdVard munch presentazioneserepisa
 
The 7 most important marketing/CRM trends of 2014 and what to do with them
The 7 most important marketing/CRM trends of 2014 and what to do with themThe 7 most important marketing/CRM trends of 2014 and what to do with them
The 7 most important marketing/CRM trends of 2014 and what to do with themArild Horsberg/Bring Dialog
 
Digtal media task guide
Digtal media task  guideDigtal media task  guide
Digtal media task guidegreg robertson
 
Half Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and TemplatesHalf Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and TemplatesInstitute for the Future
 
Kleenexpod - A new way to keep you clean
Kleenexpod - A new way to keep you cleanKleenexpod - A new way to keep you clean
Kleenexpod - A new way to keep you cleanAntonio Caamaño
 
Strategische inzet ict 100610
Strategische inzet ict 100610Strategische inzet ict 100610
Strategische inzet ict 100610Cees Corstanje
 
ALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union HeadquartersALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union HeadquartersInstitute for the Future
 
Economies of Time_Catalysts For Change Zone of Future Innovtion
Economies of Time_Catalysts For Change Zone of Future InnovtionEconomies of Time_Catalysts For Change Zone of Future Innovtion
Economies of Time_Catalysts For Change Zone of Future InnovtionInstitute for the Future
 
Entertainmentofthe80s
Entertainmentofthe80sEntertainmentofthe80s
Entertainmentofthe80sarmadilloxx
 

Viewers also liked (20)

Self-sourced Software_Catalysts For Change Zone of Future Innovtion
Self-sourced Software_Catalysts For Change Zone of Future InnovtionSelf-sourced Software_Catalysts For Change Zone of Future Innovtion
Self-sourced Software_Catalysts For Change Zone of Future Innovtion
 
A&m assessment help
A&m assessment helpA&m assessment help
A&m assessment help
 
Olivia 2009
Olivia 2009Olivia 2009
Olivia 2009
 
Gameful Cities_Catalysts For Change Zone of Future Innovtion
Gameful Cities_Catalysts For Change Zone of Future InnovtionGameful Cities_Catalysts For Change Zone of Future Innovtion
Gameful Cities_Catalysts For Change Zone of Future Innovtion
 
Short Workshop on Transforming Ideas to Intellectual Property
Short Workshop on Transforming Ideas to Intellectual PropertyShort Workshop on Transforming Ideas to Intellectual Property
Short Workshop on Transforming Ideas to Intellectual Property
 
3d Views Portfolio
3d Views Portfolio3d Views Portfolio
3d Views Portfolio
 
EdVard munch presentazione
EdVard munch presentazioneEdVard munch presentazione
EdVard munch presentazione
 
The 7 most important marketing/CRM trends of 2014 and what to do with them
The 7 most important marketing/CRM trends of 2014 and what to do with themThe 7 most important marketing/CRM trends of 2014 and what to do with them
The 7 most important marketing/CRM trends of 2014 and what to do with them
 
Digtal media task guide
Digtal media task  guideDigtal media task  guide
Digtal media task guide
 
Half Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and TemplatesHalf Day Signal Sharing Workshop Process and Templates
Half Day Signal Sharing Workshop Process and Templates
 
Kleenexpod - A new way to keep you clean
Kleenexpod - A new way to keep you cleanKleenexpod - A new way to keep you clean
Kleenexpod - A new way to keep you clean
 
Issues in the conservation of photographs
Issues in the conservation of photographsIssues in the conservation of photographs
Issues in the conservation of photographs
 
Strategische inzet ict 100610
Strategische inzet ict 100610Strategische inzet ict 100610
Strategische inzet ict 100610
 
¿Querer es poder?
¿Querer es poder?¿Querer es poder?
¿Querer es poder?
 
ALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union HeadquartersALICT opening presentation 9.10.12, African Union Headquarters
ALICT opening presentation 9.10.12, African Union Headquarters
 
EzMate 401 Arise Biotech
EzMate 401 Arise BiotechEzMate 401 Arise Biotech
EzMate 401 Arise Biotech
 
Economies of Time_Catalysts For Change Zone of Future Innovtion
Economies of Time_Catalysts For Change Zone of Future InnovtionEconomies of Time_Catalysts For Change Zone of Future Innovtion
Economies of Time_Catalysts For Change Zone of Future Innovtion
 
Entertainmentofthe80s
Entertainmentofthe80sEntertainmentofthe80s
Entertainmentofthe80s
 
Allenamente
AllenamenteAllenamente
Allenamente
 
Emotion Mapping
Emotion MappingEmotion Mapping
Emotion Mapping
 

Similar to Global Media Monitor - Marko Grobelnik

Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014eswcsummerschool
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsJohn Breslin
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collectionslljohnston
 
Wikibhasha by Dr A Kumaran
Wikibhasha by  Dr A KumaranWikibhasha by  Dr A Kumaran
Wikibhasha by Dr A KumaranNIFT
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingTom-Cramer
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapAAT Taiwan
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsikanow
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss
 
BiographyNet: Linking the world of History
BiographyNet: Linking the world of HistoryBiographyNet: Linking the world of History
BiographyNet: Linking the world of HistoryBiographyNet
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in RomaniaVlad Posea
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011lljohnston
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersUniversity of Bologna
 
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...4Science
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyPRELIDA Project
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 

Similar to Global Media Monitor - Marko Grobelnik (20)

Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social Semantics
 
Cultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data CollectionsCultural Heritage Insitutions and Big Data Collections
Cultural Heritage Insitutions and Big Data Collections
 
Wikibhasha by Dr A Kumaran
Wikibhasha by  Dr A KumaranWikibhasha by  Dr A Kumaran
Wikibhasha by Dr A Kumaran
 
IIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership MeetingIIIF for CNI Spring 2014 Membership Meeting
IIIF for CNI Spring 2014 Membership Meeting
 
Union catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldapUnion catalogandknowledge engineering for teldap
Union catalogandknowledge engineering for teldap
 
How IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problemsHow IKANOW uses MongoDB to help organizations solve really big problems
How IKANOW uses MongoDB to help organizations solve really big problems
 
Digital Humanities Workshop
Digital Humanities WorkshopDigital Humanities Workshop
Digital Humanities Workshop
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
 
BiographyNet: Linking the world of History
BiographyNet: Linking the world of HistoryBiographyNet: Linking the world of History
BiographyNet: Linking the world of History
 
Linked Open Data in Romania
Linked Open Data in RomaniaLinked Open Data in Romania
Linked Open Data in Romania
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011Leslie Johnston Keynote, Best Practices Exchange 2011
Leslie Johnston Keynote, Best Practices Exchange 2011
 
Bringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointersBringing semantic publishing into TEI: ideas and pointers
Bringing semantic publishing into TEI: ideas and pointers
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
From Digital Records to Digital Cultural Landscapes. Beyond Digital Library b...
 
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and RemedyHIBERLINK: Reference Rot and Linked Data: Threat and Remedy
HIBERLINK: Reference Rot and Linked Data: Threat and Remedy
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookmanojkuma9823
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Bookvip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
vip Sarai Rohilla Call Girls 9999965857 Call or WhatsApp Now Book
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 

Global Media Monitor - Marko Grobelnik

  • 1. Global Media Monitoring http://eventregistry.org/ Marko Grobelnik Marko.Grobelnik@ijs.si Jozef Stefan Institute Ljubljana, Slovenia Contributions from Gregor Leban, Blaz Fortuna, Janez Brank, Jan Rupnik, Andrej Muhic, Aljaz Kosmerlj ESWC Summer School, Sep 2nd 2014, Kalamaki
  • 2. Outline • Introduction • Collecting Media Data • Document Enrichment • Cross-linguality • News Reporting Bias • Event Representation • Event Tracking • Future Projects
  • 4. What questions we’ll try to answer? • Where to get global media data? • What is extractable from media documents? • How to connect information across languages? • What is an event? • How to approach diversity in news reporting? • How to visualize global event dynamics?
  • 5. Systems/Demos used within the presentation • NewsFeed (http://newsfeed.ijs.si/) • News and social media crawler • Enrycher (http://enrycher.ijs.si/) • Language and Semantic annotation • XLing (http://xling.ijs.si/ • Cross-lingual document linking and categorization • DiversiNews (http://aidemo.ijs.si/diversinews/) • News Diversity Explorer • Event Registry (http://eventregistry.org/) • Event detection and topic tracking
  • 6. The overall goal • The goal is to establish a real-time system • …to collect data from global media in real-time • …to identify events and track evolving topics • …to assign stable identifiers to events • …to identify events across languages • …to detect diversity of reporting along several dimensions • …to provide rich exploratory visualizations • …to provide interoperable data export
  • 7. Main stream news Blogs Global Media Monitoring pipeline Article semantic annotation Cross-lingual article matching Cross-lingual cluster matching Event registry Event formation API Interface Event info. extraction Input data Pre-processing steps Event construction Event storage & maintenance Extraction of date references Article clustering Identifying related events Detection of article duplicates GUI/Visualizations http://EventRegistry.org
  • 8. Collecting Media Data http://newsfeed.ijs.si/
  • 9. Where to get references to news publishers? • Good start is Wikipedia list of newspapers: • http://en.wikipedia.org/wiki/Lists_of_newspapers
  • 10. From a newspaper home-page to an article http://www.nytimes.com/ HTML RSS Feed (list of articles) Article to be retrieved
  • 11. Collecting global media data • Data collection service News-Feed • http://newsfeed.ijs.si/ • …crawling global main-stream and social media • Monitoring • ~60k main-stream publishers (RSS feeds+special feeds) • ~250k most influential blogs (RSS feeds) • free Twitter feed • Data volume: ~350k articles & blogs per day (+5M tweets) • Languages: eng (50%), ger (10%), spa (8%), fra (5%)
  • 12. Downloading the news stream (1/2) • The stream is accessible at http://newsfeed.ijs.si/stream/ • To download the whole stream continuously, you can use the python script (http://newsfeed.ijs.si/http2fs.py) • The script does the following:
  • 13. Downloading the news stream (2/2) • News Stream Contents and Format • The root element, <article-set>, contains zero or more articles in the following XML format: • …more details: • Trampus, Mitja and Novak, Blaz: The Internals Of An Aggregated Web News Feed. Proceedings of 15th Multiconference on Information Society 2012 (IS-2012). [PDF]
  • 15. What can extracted from a document? • Lexical level • Tokenization – extracting tokens from a document (words, separators, …) • Sentence splitting – set of sentences to be further processed • Linguistic level • Part-of-Speech – assigning word types (nouns, verbs, adjectives, …) • Deep Parsing – constructing parse trees from sentences • Triple extraction – subject-predicate-object triple extraction • Name entity extraction – identifying names of people, places, organizations • Semantic level • Co-reference resolution – replacing pronouns with corresponding names; merging different surface forms of names into single entity • Semantic labeling – assigning semantic identifiers to names (e.g. LOD/DBpedia/Freebase) including disambiguation • Topic classification – assigning topic categories to a document (e.g. DMoz) • Summarization – assigning importance to parts of a document • Fact extraction – extracting relevant facts from a document
  • 16. Enrycher (http://enrycher.ijs.si/) Plain text Extracted graph of triples from text Text Enrichment Diego Maradona Semantics: owl:sameAs: http://dbpedia.org/resource/Diego_Maradona owl:sameAs: http://sw.opencyc.org/concept/Mx4rvofERZwpEbGdrcN5Y29ycA rdf:type: http://dbpedia.org/class/yago/ArgentinaInternationalFootballers rdf:type: http://dbpedia.org/class/yago/ArgentineExpatriatesInItaly rdf:type: http://dbpedia.org/class/yago/ArgentineFootballManagers rdf:type: http://dbpedia.org/class/yago/ArgentineFootballers Robbie Keane Semantics: owl:sameAs: http://dbpedia.org/resource/Robbie_Keane rdf:type: http://dbpedia.org/class/yago/CoventryCityF.C.Players rdf:type: http://dbpedia.org/class/yago/ExpatriateFootballPlayersInItaly rdf:type: http://dbpedia.org/class/yago/F.C.InternazionaleMilanoPlayers “Enrycher” is available as as a web-service generating Semantic Graph, LOD links, Entities, Keywords, Categories, Text Summarization, Sentiment
  • 17. Enrycher Architecture Plain text • Enrycher is a web service consisting of a set of interlinked modules… • …covering lexical, linguistic and semantic annotations • …exporting data in XML or RDF • To execute the service, one should send an HTTP POST request, with the raw text in the body: • curl -d “Enrycher was developed at JSI, a research institute in Ljubljana. Ljubljana is the capital of Slovenia.” http://enrycher.ijs.si/run Annotated document
  • 19. Cross-linguality How to operate in many languages? • Cross-linguality is a set of functions on how to transfer information across the languages • …having this, we can track information independent of the language borders • Machine Translation is expensive and slow, so the goal is to avoid machine translation to gain speed and scale • The key building block is the function for comparing and categorization of documents in different languages • XLing.ijs.si is an open web service to bridge information across 100 languages
  • 20. Languages covered by XLing (top 100 Wikipedia languages)
  • 21. XLing (XLing.ijs.si) service for comparing and categorization of documents across 100 languages Chinese Text English Text Automatically Extracted Keywords Automatically Extracted Keywords Similarity Between Two Documents Selection Of 100 Languages
  • 22. News Reporting Bias http://aidemo.ijs.si/diversinews/
  • 24. Detecting News Reporting Bias • The task: • Given a news story, are we able to say from which news source it came? • We compared CNN and Aljazeera reports about the same events from the war in Iraq • …300 aligned articles describing the same story from both sources • The same topics are expressed in both sources with the following keywords: • CNN with: • Insurgents, Troops, Baghdad, Iran, Militant, Police, Suicide, Terrorist, United, National, Hussein, Alleged, Israeli, Syria, Terrorism… • Aljazeera with: • Attacks, Claims, Rebels, Withdrawing, Report, Fighters, President, Resistance, Occupation, Injured, Army, Demanded, Hit, Muslim, …
  • 25. DiversiNews iPad App (1/2) • DiversiNews iPad App is using newsfeed.ijs.si and enrycher.ijs.si services • …in its initial screen is shows list of current hot topics and current trending events Hot Topics Trending Events
  • 26. DiversiNews iPad App (2/2) • DiversiNews “diversity search” screen allows dynamic reranking of articles describing an event along three dimensions: • Geography – where is a content being published from • Subtopics – what are subtopics of an event • Sentiment – what are good and what are bad news • For each query it provides • Automatically generated summary • List of corresponding articles Geography Subtopics Sentiment Summary Articles
  • 28. What is an event? (abstract description) • …more practical question: what definition of is computationally feasible? • In general, an event is something which “sticks out” of the average in some kind of (high dimensional) data space • …could be interpreted as an “anomaly” • …densification of data points (e.g. many similar documents) • …significant change of distribution (e.g. a trend on Twitter) • In practice, the event could be: • A cluster od documents / change of a distribution in data • Detected in an unsupervised way • A fit to a pre-built model • Detected in a supervised way
  • 29. How to represent an event? • Baseline data for a news event is usually a cluster of documents • …with some preprocessing we extract linguistic and semantic annotations • …semantic annotations are linked to ontologies providing possibility for multiresolution annotations • Three levels of event representation: • Feature vector event representation: • …light weight representation that can be easily represented as a set of feature vectors augmented with external ontologies – suitable for scalable ML analysis • Structured event representation: • Infobox representation (slots filling) using open schema or event taxonomy • Deep event representation • Semantic representation linked to a world-model (e.g. CycKB common sense knowledge) – suitable for reasoning and diagnostics
  • 30. Feature vector event representation • Feature vectors easily extractable from news documents: • Topical dimension – what is being talked about? (keywords) • Social dimension – which entities are mentioned? (named entities) • Temporal aspect – what is the time of an event? (temporal distribution) • Geographical aspect – where an event is taking place? (location) • Publisher aspect – who is reporting? (publisher identifiers) • Sentiment/bias aspect – emotional signals (numeric estimates) • Scalable Machine Learning techniques can easily deal with such representation • …in “Event Registry” system we use this representation to describe events
  • 31. Example of “feature vector” event representation: Event Registry “Chicago” related events Where? (geography) When? (temporal distribution) Who? (named entities) What? (keyword/ topics) Query: “Chicago”
  • 32. Structured event representation • Structured event representation describes an event by its “Event Type” and corresponding information slots to be filled • Event Types should be taken from “Event Taxonomy” • …at this stage of development this level of representation still requires human intervention to achieve high accuracy (Precision/Recall) extraction • Example on the right – Wikipedia event infobox: • 2011 Tōhoku earthquake and tsunami
  • 33. “Event Taxonomy” – preview to the current development
  • 34. Prototype for event Infobox extraction: XLike annotation service • The goal is to build a system for economically viable extraction of event infoboxes • …using crowd-sourcing • …aiming at high Precision & Recall for a small cost
  • 35. Event sequences & Hierarchical events • Once having events identifies and represented we can connect events into “event sequences” (also called story-lines) • “Event sequences” include events which are supposedly related and constitute larger story • Collection of interrelated events can be also organized in hierarchies (e.g. World Cup event consists from a series of smaller events)
  • 36. An example event: Microsoft Windows 9
  • 37. Similar events example: similar events to Microsoft Windows 9 event
  • 41. Zoom-in Example Microsoft hierarchy of events
  • 43. Live Event tracking with http://EventRegistry.org/
  • 44. Event description through entities and Semantic keywords
  • 45. Collection of events described through Entity relatedness
  • 46. Collection of events described through trending concepts
  • 47. Collection of events described through three level categorization
  • 49. Collection of events described through Reporting dynamics
  • 50. Collection of events described through a story-line of related events
  • 51. Event Registry exports event data through API and RDF/Storyline ontology • API to search and export event information • Export of all the system data in JSON • Event data is exported in a structured form • BBC Storyline ontology • http://www.bbc.co.uk/ontologies/storyline/2013-05-01.html • SPARQL endpoint: • http://eventregistry.org/rdf/search • http://eventregistry.org/rdf/event/{eventID} • http://eventregistry.org/rdf/article/{articleD} • http://eventregistry.org/rdf/storyline/{storylineID} • Example: http://eventregistry.org/rdf/event/1234
  • 52. Future / Follow-up projects
  • 53. Some of the follow-up projects • Understanding global social dynamics • How global society functions? • Integrating text-based media with TV channels • …requires speech recognition, video processing, visual object recognition, face recognition, … • Event prediction / Event-Consequence prediction • …requires understanding of causality in the social dynamics and much more • Micro-reading / Machine-reading • …full understanding of individual documents – the goal for 10+ years