Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

842 views

Published on

Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

  • Be the first to comment

Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014

  1. 1. Global Media Monitoring h0p://eventregistry.org/ Marko Grobelnik Jozef Stefan Ins4tute Ljubljana, Slovenia Contribu4ons from Gregor Leban, Blaz Fortuna, Janez Brank, Jan Rupnik, Andrej Muhic ESWC Summer School, Sep 2nd 2014, Kalamaki
  2. 2. Outline • Introduc4on • Collec4ng Media Data • Document Enrichment • Cross-­‐linguality • News Repor4ng Bias • Event Representa4on • Event Tracking • Future Projects
  3. 3. Introduc=on
  4. 4. What ques=ons we’ll try to answer? • Where to get global media data? • What is extractable from media documents? • How to connect informa4on across languages? • What is an event? • How to approach diversity in news repor4ng? • How to visualize global event dynamics?
  5. 5. Systems/Demos used within the presenta=on • NewsFeed (hWp://newsfeed.ijs.si/) • News and social media crawler • Enrycher (hWp://enrycher.ijs.si/) • Language and Seman4c annota4on • XLing (hWp://xling.ijs.si/ • Cross-­‐lingual document linking and categoriza4on • DiversiNews (hWp://aidemo.ijs.si/diversinews/) • News Diversity Explorer • Event Registry (hWp://eventregistry.org/) • Event detec4on and topic tracking
  6. 6. The overall goal • The goal is to establish a real-­‐4me system • …to collect data from global media in real-­‐4me • …to iden4fy events and track evolving topics • …to assign stable iden4fiers to events • …to iden4fy events across languages • …to detect diversity of repor4ng along several dimensions • …to provide rich exploratory visualiza4ons • …to provide interoperable data export
  7. 7. Main stream news Blogs Global Media Monitoring pipeline Ar4cle seman4c annota4on Cross-­‐lingual ar4cle matching Cross-­‐lingual cluster matching Event forma4on Event registry API Interface Event info. extrac4on Input data Pre-­‐processing steps Event construc4on Event storage & maintenance Extrac4on of date references Ar4cle clustering Iden4fying related events Detec4on of ar4cle duplicates GUI/Visualiza4ons hWp://EventRegistry.org
  8. 8. Collec=ng Media Data hWp://newsfeed.ijs.si/
  9. 9. Where to get references to news publishers? • Good start is Wikipedia list of newspapers: • hWp://en.wikipedia.org/wiki/Lists_of_newspapers
  10. 10. From a newspaper home-­‐page to an ar=cle hWp://www.ny4mes.com/ HTML RSS Feed (list of ar4cles) Ar4cle to be retreived
  11. 11. Collec=ng global media data • Data collec4on service News-­‐Feed • hWp://newsfeed.ijs.si/ • …crawling global main-­‐stream and social media • Monitoring • ~60k main-­‐stream publishers (RSS feeds+special feeds) • ~250k most influen4al blogs (RSS feeds) • free TwiWer feed • Data volume: ~350k ar4cles & blogs per day (+5M tweets) • Languages: eng (50%), ger (10%), spa (8%), fra (5%)
  12. 12. Downloading the news stream (1/2) • The stream is accessible at hWp://newsfeed.ijs.si/stream/ • To download the whole stream con4nuously, you can use the python script (hWp://newsfeed.ijs.si/hWp2fs.py) • The script does the following:
  13. 13. Downloading the news stream (2/2) • News Stream Contents and Format • The root element, <ar4cle-­‐set>, contains zero or more ar4cles in the following XML format: • …more details: • Trampus, Mitja and Novak, Blaz: The Internals Of An Aggregated Web News Feed. Proceedings of 15th Mul4conference on Informa4on Society 2012 (IS-­‐2012). [PDF]
  14. 14. Document Enrichment hWp://enrycher.ijs.si/
  15. 15. What can extracted from a document? • Lexical level • Tokeniza4on – extrac4ng tokens from a document (words, separators, …) • Sentence spli<ng – set of sentences to be further processed • Linguis4c level • Part-­‐of-­‐Speech – assigning word types (nouns, verbs, adjec4ves, …) • Deep Parsing – construc4ng parse trees from sentences • Triple extrac4on – subject-­‐predicate-­‐object triple extrac4on • Name en4ty extrac4on – iden4fying names of people, places, organiza4ons • Seman4c level • Co-­‐reference resolu4on – replacing pronouns with corresponding names; merging different surface forms of names into single en4ty • Seman4c labeling – assigning seman4c iden4fiers to names (e.g. LOD/DBpedia/ Freebase) including disambigua4on • Topic classifica4on – assigning topic categories to a document (e.g. DMoz) • Summariza4on – assigning importance to parts of a document • Fact extrac4on – extrac4ng relevant facts from a document
  16. 16. Enrycher (h0p://enrycher.ijs.si/) Plain text Extracted graph of triples from text Text Enrichment Diego Maradona Seman4cs: owl:sameAs: hKp://dbpedia.org/resource/Diego_Maradona owl:sameAs: hKp://sw.opencyc.org/concept/Mx4rvofERZwpEbGdrcN5Y29ycA rdf:type: hWp://dbpedia.org/class/yago/Argen4naInterna4onalFootballers rdf:type: hWp://dbpedia.org/class/yago/Argen4neExpatriatesInItaly rdf:type: hWp://dbpedia.org/class/yago/Argen4neFootballManagers rdf:type: hWp://dbpedia.org/class/yago/Argen4neFootballers Robbie Keane Seman4cs: owl:sameAs: hKp://dbpedia.org/resource/Robbie_Keane rdf:type: hWp://dbpedia.org/class/yago/CoventryCityF.C.Players rdf:type: hWp://dbpedia.org/class/yago/ExpatriateFootballPlayersInItaly rdf:type: hWp://dbpedia.org/class/yago/F.C.InternazionaleMilanoPlayers “Enrycher” is available as as a web-­‐service genera4ng Seman4c Graph, LOD links, En44es, Keywords, Categories, Text Summariza4on, Sen4ment
  17. 17. Enrycher Architecture • Enrycher Plain text is a web service consis4ng of a set of interlinked modules… • …covering lexical, linguis4c and seman4c annota4ons • …expor4ng data in XML or RDF • To execute the service, one should send an HTTP POST request, with the raw text in the body: • curl -d “Enrycher was developed at JSI, a research institute in Ljubljana. Ljubljana is the capital of Slovenia.” http://enrycher.ijs.si/run! Annotated document
  18. 18. Cross-­‐linguality hWp://xling.ijs.si/
  19. 19. Cross-­‐linguality How to operate in many languages? • Cross-­‐linguality is a set of func4ons on how to transfer informa4on across the languages • …having this, we can track informa4on independent of the language borders • Machine Transla4on is expensive and slow, so the goal is to avoid machine transla4on to gain speed and scale • The key building block is the func4on for comparing and categoriza4on of documents in different languages • XLing.ijs.si is an open web service to bridge informa4on across 100 languages
  20. 20. Languages covered by XLing (top 100 Wikipedia languages)
  21. 21. XLing (XLing.ijs.si) service for comparing and categoriza=on of documents across 100 languages Chinese Text English Text Automa4cally Extracted Keywords Automa4cally Extracted Keywords Similarity Between Two Documents Selec4on Of 100 Languages
  22. 22. News Repor=ng Bias hWp://aidemo.ijs.si/diversinews/
  23. 23. News Repor=ng Bias example
  24. 24. Detec=ng News Repor=ng Bias • The task: • Given a news story, are we able to say from which news source it came? • We compared CNN and Aljazeera reports about the same events from the war in Iraq • …300 aligned ar4cles describing the same story from both sources • The same topics are expressed in both sources with the following keywords: • CNN with: • Insurgents, Troops, Baghdad, Iran, Militant, Police, Suicide, Terrorist, United, Na4onal, Hussein, Alleged, Israeli, Syria, Terrorism… • Aljazeera with: • AWacks, Claims, Rebels, Withdrawing, Report, Fighters, President, Resistance, Occupa4on, Injured, Army, Demanded, Hit, Muslim, …
  25. 25. DiversiNews iPad App (1/2) • DiversiNews iPad App is using newsfeed.ijs.si and enrycher.ijs.si services • …in its ini4al screen is shows list of current hot topics and current trending events Hot Topics Trending Events
  26. 26. DiversiNews iPad App (2/2) • DiversiNews “diversity search” screen allows dynamic reranking of ar4cles describing an event along three dimensions: • Geography – where is a content being published from • Subtopics – what are subtopics of an event • Sen4ment – what are good and what are bad news • For each query it provides • Automa4cally generated summary • List of corresponding ar4cles Geography Subtopics Sen4ment Summary Ar4cles
  27. 27. Event Representa=on hWp://eventregistry.org/
  28. 28. What is an event? (abstract descrip=on) • …more prac4cal ques4on: what defini4on of is computa4onally feasible? • In general, an event is something which “s4cks out” of the average in some kind of (high dimensional) data space • …could be interpreted as an “anomaly” • …densifica4on of data points (e.g. many similar documents) • …significant change of distribu4on (e.g. a trend on TwiWer) • In prac4ce, the event could be: • A cluster od documents / change of a distribu4on in data • Detected in an unsupervised way • A fit to a pre-­‐built model • Detected in a supervised way
  29. 29. How to represent an event? • Baseline data for a news event is usually a cluster of documents • …with some preprocessing we extract linguis4c and seman4c annota4ons • …seman4c annota4ons are linked to ontologies providing possibility for mul4resolu4on annota4ons • Three levels of event representa4on: • Feature vector event representa4on: • …light weight representa4on that can be easily represented as a set of feature vectors augmented with external ontologies – suitable for scalable ML analysis • Structured event representa4on: • Infobox representa4on (slots filling) using open schema or event taxonomy • Deep event representa4on • Seman4c representa4on linked to a world-­‐model (e.g. CycKB common sense knowledge) – suitable for reasoning and diagnos4cs
  30. 30. Feature vector event representa=on • Feature vectors easily extractable from news documents: • Topical dimension – what is being talked about? (keywords) • Social dimension – which en44es are men4oned? (named en44es) • Temporal aspect – what is the 4me of an event? (temporal distribu4on) • Geographical aspect – where an event is taking place? (loca4on) • Publisher aspect – who is repor4ng? (publisher iden4fiers) • Sen4ment/bias aspect – emo4onal signals (numeric es4mates) • Scalable Machine Learning techniques can easily deal with such representa4on • …in “Event Registry” system we use this representa4on to describe events
  31. 31. Example of “feature vector” event representa=on: Event Registry “Chicago” related events Where? (geography) When? (temporal distribu4on) Who? (named en44es) What? (keyword/ topics) Query: “Chicago”
  32. 32. Structured event representa=on • Structured event representa4on describes an event by its “Event Type” and corresponding informa4on slots to be filled • Event Types should be taken from “Event Taxonomy” • …at this stage of development this level of representa4on s4ll requires human interven4on to achieve high accuracy (Precision/Recall) extrac4on • Example on the right – Wikipedia event infobox: • 2011 Tōhoku earthquake and tsunami
  33. 33. “Event Taxonomy” – preview to the current development
  34. 34. Prototype for event Infobox extrac=on: XLike annota=on service • The goal is to build a system for economically viable extrac4on of event infoboxes • …using crowd-­‐sourcing • …aiming at high Precision & Recall for a small cost
  35. 35. Event sequences & Hierarchical events • Once having events iden4fies and represented we can connect events into “event sequences” (also called story-­‐lines) • “Event sequences” include events which are supposedly related and cons4tute larger story • Collec4on of interrelated events can be also organized in hierarchies (e.g. World Cup event consists from a series of smaller events)
  36. 36. An example event: Microsoa Windows 9
  37. 37. Similar events example: similar events to Microsoa Windows 9 event
  38. 38. Event sequence iden=fica=on
  39. 39. Hierarchy of events
  40. 40. Example Microsoa hierarchy of events
  41. 41. Zoom-­‐in Example Microsoa hierarchy of events
  42. 42. Event Tracking hWp://eventregistry.org/
  43. 43. Live Event tracking with h0p://EventRegistry.org/
  44. 44. Event descrip4on through en44es and Seman4c keywords
  45. 45. Collec4on of events described through En4ty relatedness
  46. 46. Collec4on of events described through trending concepts
  47. 47. Collec4on of events described through three level categoriza4on
  48. 48. Events iden4fied across languages
  49. 49. Collec4on of events described through Repor4ng dynamics
  50. 50. Collec4on of events described through a story-­‐line of related events
  51. 51. Event Registry exports event data through API and RDF/Storyline ontology • API to search and export event informa4on • Export of all the system data in JSON • Event data is exported in a structured form • BBC Storyline ontology • hWp://www.bbc.co.uk/ontologies/storyline/2013-­‐05-­‐01.html • SPARQL endpoint: • hWp://eventregistry.org/rdf/search • hWp://eventregistry.org/rdf/event/{eventID} • hWp://eventregistry.org/rdf/ar4cle/{ar4cleD} • hWp://eventregistry.org/rdf/storyline/{storylineID} • Example: hWp://eventregistry.org/rdf/event/1234
  52. 52. Future / Follow-­‐up projects
  53. 53. Some of the follow-­‐up projects • Understanding global social dynamics • How global society func4ons? • Integra4ng text-­‐based media with TV channels • …requires speech recogni4on, video processing, visual object recogni4on, face recogni4on, … • Event predic4on / Event-­‐Consequence predic4on • …requires understanding of causality in the social dynamics and much more • Micro-­‐reading / Machine-­‐reading • …full understanding of individual documents – the goal for 10+ years

×