Annotating streams of heterogeneous data for topic generation


Published on

Talk given at the VU University Amsterdam, NL - February 6, 2013

Abstract: Since the advent of Linked Data, we have observed a dramatic increase of structured data sources published on the Web. They provide mainly entity to entity interconnections, resulting in a Web of Linked Entities, disambiguated through URIs, spanning structured and unstructured data. Several efforts have been made to exploit such a mine of information for enhancing text understanding, by connecting pieces of text to real world objects, i.e. entities, that are easily discoverable by intelligent agents, resulting in a proliferation of different systems for text annotation through "Web" entities.

In this perspective, we have developed a framework for harmonizing the access to such systems and their output results. The NERD ontology [1] aligns the difference in the annotations and provide a definition for a set of axioms taken from the long tail distribution of common classes among the used extractors. Powered on top of the NERD ontology, we have developed NERD [2] which implements a combined logic that looks for minimizing the error of annotation taking the best, when possible, from these extractors. We have observed that the well-known entity classes, such as Person, Location, Organization are well covered from these extractors, while Event is less, mainly due to a lack of definition and knowledge about what are events. As a follow-up of the Eventmedia project [3], we are defining an event spotter which takes advantage from the large event graph knowledge described in the Eventmedia dataset [4].

Sources of structured and unstructured data are also social platforms. They constantly record streams of heterogeneous data about human’s activities, feelings, emotions, conversations opening a window to the world in real-time. Making sense out of these streams is extremely challenging. We are currently investigating the role of named entities as centroids for micropost topic generations, presenting
them through visual galleries.

[1] -
[2] -
[3] -
[4] -

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Annotating streams of heterogeneous data for topic generation

  1. 1. Annotating streams ofheterogeneous data for topic generation Giuseppe Rizzo @giusepperizzo
  2. 2. Spotting entities while reading a document ➢ Name of People, Locations, Organizations, etc.. ➢ Named entities are fundamental keys for topic understanding ➢ But, the same location can refer source: to different placesFerbruary 6, 2013 VU University Amsterdam, NL 2/22
  3. 3. A Web of Linked Entities ➢ GGG (global giant graph) ➢ Nodes are Web entities source: ➢ Entities provide disambiguation pointers ➢ Entities can be univocally referred (disambiguated) ➢ Entities as centroids for topic generation and undestanding source: http://wole2012.eurecom.frFerbruary 6, 2013 VU University Amsterdam, NL 3/22
  4. 4. on Entity extractors I ati UR gu bi I AP am eb is W DFerbruary 6, 2013 VU University Amsterdam, NL 4/22
  5. 5. Diversity Alchemy DBpedia Extractiv Lupedia Open Saplo Semi Wikimeta Yahoo! Zemanta API Spotlight Calais TagsLanguage EN,FR, EN EN EN,FR, EN,FR EN, DE, EN,FR EN EN DE,IT, IT SP SW NL SP PT,RU, SP,SWGranularity OEN OEN OEN OEN OEN OED OED OEN OEN OEDEntity N/A char word range of char N/A char POS range N/Aposition offset offset chars offset offset offset of charsClassification Alchemy DBpedia Extractiv DBpedia Open Saplo ConLL ESTER Yahoo FreeBaseschema FreeBase LinkedM Calais -3 Scema.or DB gNumber of 324 320 34 319 95 5 4 7 13 81classesResponse JSON HTML HTML HTML JSON JSON XML JSON JSON XMLFormat MicroF JSON JSON JSON MicroF XML XML JSON XML RDF RDF RDFa ormat RDF RDF XML XML XMLQuota 30000 unl 3000 unl 50000 1333 unl unl 5000 10000(calls/day) Ferbruary 6, 2013 VU University Amsterdam, NL 5/22
  6. 6. Harmonizing annotations ontology1 REST API2 UI31 http://nerd.eurecom.frFerbruary 6, 2013 VU University Amsterdam, NL 6/22
  7. 7. NERD Ontology NERD type Occurrence Person 10 Organization 10 Country 6 Company 6 Location 6 Continent 5 City 5 RadioStation 5 Album 5 Product 5 ... ... The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked DataFerbruary 6, 2013 VU University Amsterdam, NL 7/22
  8. 8. ETAPE2012 ➢ DGA (French radio transcripts) – Train: 7h 50m – Dev: 3h – Eval: 3h ➢ ELDA (French TV transcripts) – Train: 18h 10m – Dev: 7h 55m – Eval: 7h 55m ➢ Annotation schema Quaero: 32 classesFerbruary 6, 2013 VU University Amsterdam, NL 8/22
  9. 9. We can do better: combined 2 201 A PE ET extraction (eA1,tA1,URIA1,siA1,eiA1) ... ... ... cleaning (eA2,tA2,URIA2,siA2,eiA2) (eA3,tA3,URIA3,siA3,eiA3) fusion When at least 2 extractors (eN1,tN1,URIN1,siN1,eiN1) classify the same entity with a (eN2,tN2,URIN2,siN2,eiN2) different type then we apply a preferred selection order (learning rules): Wikimeta, AlchemyAPI, OpenCalais, LupediaFerbruary 6, 2013 VU University Amsterdam, NL 9/22
  10. 10. … but it introduced systematic errors 201 2 A PE ET SLR (Slot prec recall F1 %correct Error Rate)alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%lupedia 39.49% 22.87% 1.56% 2.91% 1.56%opencalais 37.47% 41.69% 3.53% 6.49% 3.53%wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%combined 86.85% 35.31% 17.69% 23.44% 17.69%(nerd)Ferbruary 6, 2013 VU University Amsterdam, NL 10/22
  11. 11. Gazetteers: combined+ 2 201 A PE ET ... Learned model POS tagger Created (eA1,tA1,URIA1,siA1,eA1) Apply rules static rules (eA2,tA2,URIA2,siA2,eiA2) fusion (e1,t1,URI1,si1,ei1) Conflicts handled by priority selection:own, Wikimeta,AlchemyAPI, OpenCalais,Lupedia (eN1,tN1,URIN1,sN1,eN1) `Ferbruary 6, 2013 VU University Amsterdam, NL 11/22
  12. 12. Over-estimated training model 2 201 A PE ET SLR (Slot prec recall F1 %correct Error Rate) combined 86.85% 35.31% 17.69% 23.44% 17.69% combined+ 188.81% 15.13% 28.40% 19.45% 28.40%Ferbruary 6, 2013 VU University Amsterdam, NL 12/22
  13. 13. General NER limitations ➢ Perfomances drop – with common settings using off-the-shelf models, while annotating corpora which differs from the training model (empirically recall drops of ~20%) – with noisy texts such as transcripts, microposts ➢ Lack of knowledge for particular categories, in particular EventFerbruary 6, 2013 VU University Amsterdam, NL 13/22
  14. 14. Participation at the #MSM2013 challenge in g ➢ English Twitter posts go – Train: 2815 posts on – Eval: 1526 posts ➢ Annotation schema: 4 classes ➢ Objective: perform better than the Stanford CFR, properly trained with the challenge settings prec recall F1 LOC 80.12% 57.76% 67.63% MISC 68.18% 31.51% 43.10% ORG 83.28% 50.71% 63.04% PER 79.93% 70.72% 75.04% 4-fold cross validation over training - provisional results of the Stanford CFRFerbruary 6, 2013 VU University Amsterdam, NL 14/22
  15. 15. Poor performances of spotting events ➢ Exploit large domain knowledge represented by the Eventmedia dataset1 ➢ EventSpotter – Entities classified according to the LODE ontology – Spotting according to the event name, agents, temporal and geo spatial information – Confidence computed according to the similarity of the surrounding text where the entity has been spotted and the event description – Disambiguation provided by the event URIs (nodes of the Eventmedia graph) 1 6, 2013 VU University Amsterdam, NL 15/22
  16. 16. Entities for concept mining ➢ Used to annotate textual data – news articles, and ... ➢ Video transcripts: – video segmentation (MediaFragment) – MediaFragment annotation – indexing – topic model generation ➢ Microposts: – text understanding – topic model generationFerbruary 6, 2013 VU University Amsterdam, NL 16/22
  17. 17. Media Fragment Enricher joint work between University ofsource: Southampton and EURECOM Ferbruary 6, 2013 VU University Amsterdam, NL 17/22
  18. 18. Annotating social streams ➢ Live and fresh breaking news: microposts ➢ Media items, such as pictures and videos, usually are attached to microposts ➢ Grouping microposts: – Entity labels – Entity classes – Latent Dirichlet allocation (LDA) – Density based micropost proximity (text similarity, entity similarity, temporal distance) ➢ Create textual storyboards from vox populi ➢ Describe visually the created storyboardsFerbruary 6, 2013 VU University Amsterdam, NL 18/22
  19. 19. Centroids for topicgeneration ➢ Each cloud represents a topic ➢ A topic is depicted by an entity ➢ Leaf are media items, which visually represent the microposts ➢ Each leaf can belong to many topicsFerbruary 6, 2013 VU University Amsterdam, NL 19/22
  20. 20. Topic storyboard ➢ Visual summary of the topic ➢ Topic is labelled with an entity ➢ A poster picture is displayed according to the relevance of the micropost in the generated topic ➢ If the entity points to a LOD resource, a textual description is displayedFerbruary 6, 2013 VU University Amsterdam, NL 20/22
  21. 21. Outlook ➢ Modelling heterogeneous data with entities ➢ Linking entities according to the topics extracted from the text ➢ Enhancing topic modelling with the GGG ➢ Providing visual storyboards tailored with the extracted topicsFerbruary 6, 2013 VU University Amsterdam, NL 21/22
  22. 22. Thanks for your time and attention Agenda: – Web of Linked Entities (sl. 3) – Aligning annotations (sl. 6) – Combining performances of 3rd- party entity extractors (sl. 9) – Spotting events (sl. 15) – Annotating MFs and microposts for topic generation (sl. 16) – Topic storyboard generation (sl. 19) 6, 2013 VU University Amsterdam, NL 22/22