0
NERD: an open source
platform for extracting and
disambiguating named entities
in very diverse documents
Raphaël Troncy <r...
What is a Named Entity recognition task?
 A task that aims to locate and classify the name of a
person or an organization...
Example
 “ I want to book a room in an hotel located in
the heart of Paris, just a stone’s throw from the
Eiffel Tower ”
...
Part of Speech
I
want
to
book
a
room
in
…
Paris

PRP
VBP
TO
VB
DT
NN
IN
…
NNP

NER: What is Paris?
NEL: Which Paris are we...
What is Paris? Type Ambiguity

dbpedia-owl:Asteroid

schema:City

schema:Movie
dbpedia-owl:Film

Giuseppe Rizzo, “Learning...
Named Entity Recognition (NER)
I
want
to
book
a
room
in
…
Paris

PRP
VBP
TO
VB
DT
NN
IN
…
NNP

O
O
O
O
O
O
O
…
LOC

Giusep...
What is Paris? Name Ambiguity

Paris, Kentucky

Paris, France

Paris, Maine

Paris, Idaho

Paris, Tennessee

Paris, Ontari...
Named Entity Linking (NEL)
I
want
to
book
a
room
in
…
Paris

PRP
VBP
TO
VB
DT
NN
IN
…
NNP

O
O
O
O
O
O
O
…
LOC

O
O
O
O
O
...
NER Tools and Web APIs
 Standalone software
 GATE
 Stanford CoreNLP
 Temis

http://nerd.eurecom.fr/

 Web APIs

22/10...
NERD: Named Entity Recognition and
Disambiguation
 Compare performances of
NER and NEL tools
 Understand strengths and w...
What is NERD?
ontology1

REST API2
UI3

1

http://nerd.eurecom.fr/ontology
2 http://nerd.eurecom.fr/api/application.wadl
3...
Factual comparison of 10 Web NER tools
Alchemy
API

DBpedia
Spotlight

Evri

Extractiv

Lupedia

Open
Calais

Saplo

Wikim...
NERD Ontology

Aligned the taxonomies used by
the extractors
22/10/2013 -

NLP&DBpedia International Workshop, Sydney, Oct...
NERD type

Building the NERD Ontology

Occurrence

Person

10

Organization

10

Country
Company

6

Continent

5

City

5...
NERD REST API
RDF
/document
/user
/annotation/{extractor}
/extraction
/evaluation
...

GET,
POST,
PUT,
DELETE

JSON
“entit...
NERD meets NIF
Model documents through a
set of strings deferencable on
the Web
: offset_23107_ 23110 a str:String ;
str:r...
NERD User Dashboard

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 17
NERD User Interface

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 18
History of NER benchmarks
 CoNLL 2003 and CoNLL 2005
 schema (4 types): person, organization, location and miscellaneous...
ETAPE 2012 challenge
genre

train

dev

test

TV news

7h 40m

1h 40m

1h 40m

BFM Story, Top QUestions (LCP)

TV debates
...
NERD @ ETAPE (naïve combined strategy)
extraction

(eA1,tA1,URIA1,siA1,eiA1) ...
(eA2,tA2,URIA2,siA2,eiA2)
(eA3,tA3,URIA3,...
Participation at ETAPE (combined+ strategy)
ETAPE
Train & Dev

...
Learned model

POS tagger

Created
static rules

Apply ...
NERD Global results

SLR

Precision

Recall

F-measure

%correct

combined

86.85%

35.31%

17.69%

23.44%

17.69%

combin...
Per-extractor results
SLR

Precision

Recall

F-measure

%correct

alchemyapi

37.71%

47.95%

5.45%

9.68%

5.45%

lupedi...
22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 25
Learning How to Combine NER Extractors

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 26
NERD on CoNLL 2003 (NER task)

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 27
NERD on MSM 2013 (NER task)

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 28
NERD on MSM 2013 (NEL task)

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 29
Media Fragment Enricher:
http://mfe.synote.org/mfe/

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 201...
Linking pieces of knowledge

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 31
Linking pieces of knowledge

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 32
Named Entities for Video Classification

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 33
Workflow
5:Timed Text
6: NEs with time
alignment
(json)

2: Metadata
7: RDFize (ttl)
Media Fragment Enricher Services
Meta...
Channel signature based on NE distribution

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 35
22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 36
LinkedTV: automatic annotations ...

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 37
... and enrichment for hypervideos

CONCEPT IN
PLAYER
Cubism

Expressionism

Fauvism

FACETS / PROPERTIES OF CONCEPT
22/10...
Media Fragments and Annotations

http://data.linkedtv.eu/medi
a/e2899e7f#t=840,900

nerd:Location
Casablanca

nerd:Locatio...
Enrichment and Hypervideos

nerd:Location
Casablanca

nerd:Location
Cafe Rick

nerd:Person
H. Bogart

Nerd:Person
E. Tiern...
Media Fragment + Open Annotation + NERD
Locator

MediaResource
OffsetBasedString

Annotation

MediaFragment

Entity
Type

...
Towards a Linked Media Layer
 Enriching media with media from a closed collection
(e.g. BBC archive)
 The MediaEval scen...
Seed video enriched with web content
rbbaktuell_20120809

nerd:Location
Brandenburg
oa
Enrichments are Annotations too

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 44
Media Finder (named entities clustering)

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 45
Media Finder (zooming in a cluster)

22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 46
Media Finder: http://mediafinder.eurecom.fr/
 Live Topic Generation from Event Streams
 WWW 2013 Demo Session
 http://w...
Credits
 Giuseppe Rizzo, Vuk Milicic,
José Luis Redondo Garcia (EURECOM)
 Thomas Steiner (Google Inc.)
 Marieke van Erp...
http://www.slideshare.net/troncy
22/10/2013 -

NLP&DBpedia International Workshop, Sydney, October 2013

- 49
Upcoming SlideShare
Loading in...5
×

NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

2,472

Published on

"NERD: an open source platform for extracting and disambiguating named entities in very diverse documents" - Keynote Talk given at the NLP&DBpedia International Workshop (NLP&DBpedia), 22 October 2013

Published in: Technology, Education
2 Comments
8 Likes
Statistics
Notes
  • This is our research work, see also our publication at LDOW 2012, http://www.eurecom.fr/~troncy/Publications/Rizzo_Troncy-ldow12.pdf Those numbers were true at that time. I should indeed have put this as a disclaimer, since APIs are constantly evolving and an update is probably necessary.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hey Rafael, where did you get the numbers of slide 12?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,472
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
42
Comments
2
Likes
8
Embeds 0
No embeds

No notes for slide

Transcript of "NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013"

  1. 1. NERD: an open source platform for extracting and disambiguating named entities in very diverse documents Raphaël Troncy <raphael.troncy@eurecom.fr> Giuseppe Rizzo <giuseppe.rizzo@eurecom.fr>
  2. 2. What is a Named Entity recognition task?  A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -2
  3. 3. Example  “ I want to book a room in an hotel located in the heart of Paris, just a stone’s throw from the Eiffel Tower ” Eric Charton, “Named Entity Detection and Entity Linking in the Context of Semantic Web: Exploring the ambiguity question” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -3
  4. 4. Part of Speech I want to book a room in … Paris PRP VBP TO VB DT NN IN … NNP NER: What is Paris? NEL: Which Paris are we talking about? Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -4
  5. 5. What is Paris? Type Ambiguity dbpedia-owl:Asteroid schema:City schema:Movie dbpedia-owl:Film Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -5
  6. 6. Named Entity Recognition (NER) I want to book a room in … Paris PRP VBP TO VB DT NN IN … NNP O O O O O O O … LOC Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -6
  7. 7. What is Paris? Name Ambiguity Paris, Kentucky Paris, France Paris, Maine Paris, Idaho Paris, Tennessee Paris, Ontario Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -7
  8. 8. Named Entity Linking (NEL) I want to book a room in … Paris PRP VBP TO VB DT NN IN … NNP O O O O O O O … LOC O O O O O O O … http://dbpedia.org/resource/Paris Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -8
  9. 9. NER Tools and Web APIs  Standalone software  GATE  Stanford CoreNLP  Temis http://nerd.eurecom.fr/  Web APIs 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -9
  10. 10. NERD: Named Entity Recognition and Disambiguation  Compare performances of NER and NEL tools  Understand strengths and weaknesses of different Web APIs  Adapt NER processing to different context  (Learn how to) Combine NER (/ NEL) tools  Participate in various benchmarks 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 10
  11. 11. What is NERD? ontology1 REST API2 UI3 1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl 3 http://nerd.eurecom.fr 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 11
  12. 12. Factual comparison of 10 Web NER tools Alchemy API DBpedia Spotlight Evri Extractiv Lupedia Open Calais Saplo Wikimeta Yahoo! Zemanta Language EN,FR, GR,IT, PT,RU, SP,SW EN GR* PT* SP* EN,I T EN EN,FR, IT EN,FR SP EN, SW EN,FR SP EN EN Granularity OEN OEN OED OEN OEN OEN OED OEN OEN OED Entity position N/A char offset N/A word offset range of chars char offset N/A POS offset range of chars N/A Alchemy DBpedia FreeBase Scema.or g Evri DBpedia DBpedia LinkedM DB Open Calais N/A ESTER Yahoo FreeBase Number of classes 324 320 5 34 319 95 5 7 13 81 Response Format JSON MicroF XML RDF HTML JSON RDF XML HTM L JSO N RDF HTML JSON RDF XML HTML JSON RDFa XML JSON MicroF ormat JSON JSON XML JSON XML XML JSON RDF Quota (calls/day) 30000 unl 300 3000 unl 50000 NLP&DBpedia International Workshop, Sydney, October 2013 0 1333 unl 5000 10000 Classification schema 22/10/2013 - 12/15
  13. 13. NERD Ontology Aligned the taxonomies used by the extractors 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 13
  14. 14. NERD type Building the NERD Ontology Occurrence Person 10 Organization 10 Country Company 6 Continent 5 City 5 RadioStation 5 Album 5 Product 5 ... NLP&DBpedia International Workshop, Sydney, October 2013 6 Location 22/10/2013 - 6 ... - 14
  15. 15. NERD REST API RDF /document /user /annotation/{extractor} /extraction /evaluation ... GET, POST, PUT, DELETE JSON “entities” : [{ “entity”: “Tim Berners-Lee” , “type”: “Person” , “uri”: "http://dbpedia.org/resource/Tim_berners_lee", “nerdType”: "http://nerd.eurecom.fr/ontology#Person", “startChar”: 30, “endChar”: 45, “confidence”: 1, “relevance”: 0.5 }] Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 15
  16. 16. NERD meets NIF Model documents through a set of strings deferencable on the Web : offset_23107_ 23110 a str:String ; str:referenceContext :offset_0_26546 . Map string to entity : offset_23107_ 23110 sso:oen dbpedia:W3C. Classification dbpedia:W3C rdf:type nerd:Organization . Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 16
  17. 17. NERD User Dashboard 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 17
  18. 18. NERD User Interface 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 18
  19. 19. History of NER benchmarks  CoNLL 2003 and CoNLL 2005  schema (4 types): person, organization, location and miscellaneous  ACE 2004, ACE 2005 and ACE 2007  schema (7 types): person, organization, location, facility, weapon, vehicle and geo-political entity  entity recognition, co-ref, find relationships among entities extracted  TAC 2009 (Knowledge Base Track)  schema (3 types): person, organization and location  create a knowledge base from the named entities extracted  ETAPE 2012 (Named Entity Task)  schema: Quaero (7 main types, 32 sub-types)  MSM 2013: tweet corpus !  schema (4 types): person, organization, location, miscellaneous 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 19
  20. 20. ETAPE 2012 challenge genre train dev test TV news 7h 40m 1h 40m 1h 40m BFM Story, Top QUestions (LCP) TV debates 10h 30m 5h 10m 5h 10m Pile et Face, Ca vous regarde, Entre les lignes (LCP) 1h 05m 1h 05m La place du village (TV8) TV amusements - sources Train Dev Eval Item length 26h 10h 55m 10h 55m Nb files 44 15 15 Nb words 290517 91656 115511 Nb Named Entities 46763 14398 13055 Nb unique categories 33 33 33 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 20
  21. 21. NERD @ ETAPE (naïve combined strategy) extraction (eA1,tA1,URIA1,siA1,eiA1) ... (eA2,tA2,URIA2,siA2,eiA2) (eA3,tA3,URIA3,siA3,eiA3) ... ... cleaning fusion ` 22/10/2013 - (eN1,tN1,URIN1,siN1,eiN1) (eN2,tN2,URIN2,siN2,eiN2) When at least 2 extractors classify the same entity with a different type then we apply a preferred selection order (empirically defined): Wikimeta, AlchemyAPI, OpenCalais, Lupedia NLP&DBpedia International Workshop, Sydney, October 2013 - 21
  22. 22. Participation at ETAPE (combined+ strategy) ETAPE Train & Dev ... Learned model POS tagger Created static rules Apply rules (eA1,tA1,URIA1,siA1,eA1 ) (eA2,tA2,URIA2,siA2,eiA2 ) (e1,t1,URI1,si1,ei1) fusion Conflicts handled by priority selection: own, Wikimeta,AlchemyAPI, OpenCalais,Lupedia (eN1,tN1,URIN1,sN1,eN1) `(e ,t ,URI ,s ,e ) N2 N2 N2 N2 N2 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 22
  23. 23. NERD Global results SLR Precision Recall F-measure %correct combined 86.85% 35.31% 17.69% 23.44% 17.69% combined+ 188.81% 15.13% 28.40% 19.45% 28.40% Combined+ : Eval corpus differs substantially from the Train & Dev corpora. The static rules do not fit well the Eval corpora and they introduce classification noise. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 23
  24. 24. Per-extractor results SLR Precision Recall F-measure %correct alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45% lupedia 39.49% 22.87% 1.56% 2.91% 1.56% opencalais 37.47% 41.69% 3.53% 6.49% 3.53% wikimeta 36.67% 19.40% 4.25% 6.95% 4.25% combined (nerd) 86.85% 35.31% 17.69% 23.44% 17.69% combined+ (nerd+) 188.81% 15.13% 28.40% 19.45% 28.40% 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 24
  25. 25. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 25
  26. 26. Learning How to Combine NER Extractors 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 26
  27. 27. NERD on CoNLL 2003 (NER task) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 27
  28. 28. NERD on MSM 2013 (NER task) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 28
  29. 29. NERD on MSM 2013 (NEL task) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 29
  30. 30. Media Fragment Enricher: http://mfe.synote.org/mfe/ 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 30
  31. 31. Linking pieces of knowledge 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 31
  32. 32. Linking pieces of knowledge 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 32
  33. 33. Named Entities for Video Classification 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 33
  34. 34. Workflow 5:Timed Text 6: NEs with time alignment (json) 2: Metadata 7: RDFize (ttl) Media Fragment Enricher Services Metadata & timed-text 1: Video URL NERD Client 3: metadata RDFizator 9: SPARQL query 4:NERDify Video and metadata preview Categorization Triple Store 8: Generate Category Video replay with subtitles and aligned NEs Media Fragment Enricher UI 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 34
  35. 35. Channel signature based on NE distribution 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 35
  36. 36. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 36
  37. 37. LinkedTV: automatic annotations ... 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 37
  38. 38. ... and enrichment for hypervideos CONCEPT IN PLAYER Cubism Expressionism Fauvism FACETS / PROPERTIES OF CONCEPT 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 CONTENT ENRICHMENT - 38
  39. 39. Media Fragments and Annotations http://data.linkedtv.eu/medi a/e2899e7f#t=840,900 nerd:Location Casablanca nerd:Location Cafe Rick nerd:Person H. Bogart nerd:Person I. Bergman  Media Fragment URI 1.0     22/10/2013 - Chapters Scenes Shots etc… NLP&DBpedia International Workshop, Sydney, October 2013 - 39
  40. 40. Enrichment and Hypervideos nerd:Location Casablanca nerd:Location Cafe Rick nerd:Person H. Bogart Nerd:Person E. Tierney 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 nerd:Person I. Bergman nerd:Location China - 40
  41. 41. Media Fragment + Open Annotation + NERD Locator MediaResource OffsetBasedString Annotation MediaFragment Entity Type URL (hyperlink) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 41
  42. 42. Towards a Linked Media Layer  Enriching media with media from a closed collection (e.g. BBC archive)  The MediaEval scenario (~ 1697 hours of archived BBC video) http://www.multimediaeval.org/mediaeval2013/hyper2013/  Enriching media with content from the open web  LinkedTV scenarios: white listed web sites for each program  Media Collector for Social Media 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 42
  43. 43. Seed video enriched with web content rbbaktuell_20120809 nerd:Location Brandenburg oa
  44. 44. Enrichments are Annotations too 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 44
  45. 45. Media Finder (named entities clustering) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 45
  46. 46. Media Finder (zooming in a cluster) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 46
  47. 47. Media Finder: http://mediafinder.eurecom.fr/  Live Topic Generation from Event Streams  WWW 2013 Demo Session  http://www.youtube.com/watch?v=8iRiwz7cDYY 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 47
  48. 48. Credits  Giuseppe Rizzo, Vuk Milicic, José Luis Redondo Garcia (EURECOM)  Thomas Steiner (Google Inc.)  Marieke van Erp (Free University of Amsterdam)  Yunjia Li (University of Southampton)  … and many other students 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 48
  49. 49. http://www.slideshare.net/troncy 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 49
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×