NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013
Upcoming SlideShare
Loading in...5
×
 

NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013

on

  • 1,280 views

"NERD: an open source platform for extracting and disambiguating named entities in very diverse documents" - Keynote Talk given at the NLP&DBpedia International Workshop (NLP&DBpedia), 22 October ...

"NERD: an open source platform for extracting and disambiguating named entities in very diverse documents" - Keynote Talk given at the NLP&DBpedia International Workshop (NLP&DBpedia), 22 October 2013

Statistics

Views

Total Views
1,280
Slideshare-icon Views on SlideShare
1,279
Embed Views
1

Actions

Likes
4
Downloads
15
Comments
2

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

12 of 2

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013 NERD: an open source platform for extracting and disambiguating named entities in very diverse documents NLP-DBpedia 2013 Presentation Transcript

    • NERD: an open source platform for extracting and disambiguating named entities in very diverse documents Raphaël Troncy <raphael.troncy@eurecom.fr> Giuseppe Rizzo <giuseppe.rizzo@eurecom.fr>
    • What is a Named Entity recognition task?  A task that aims to locate and classify the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent in a textual document 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -2
    • Example  “ I want to book a room in an hotel located in the heart of Paris, just a stone’s throw from the Eiffel Tower ” Eric Charton, “Named Entity Detection and Entity Linking in the Context of Semantic Web: Exploring the ambiguity question” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -3
    • Part of Speech I want to book a room in … Paris PRP VBP TO VB DT NN IN … NNP NER: What is Paris? NEL: Which Paris are we talking about? Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -4
    • What is Paris? Type Ambiguity dbpedia-owl:Asteroid schema:City schema:Movie dbpedia-owl:Film Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -5
    • Named Entity Recognition (NER) I want to book a room in … Paris PRP VBP TO VB DT NN IN … NNP O O O O O O O … LOC Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -6
    • What is Paris? Name Ambiguity Paris, Kentucky Paris, France Paris, Maine Paris, Idaho Paris, Tennessee Paris, Ontario Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -7
    • Named Entity Linking (NEL) I want to book a room in … Paris PRP VBP TO VB DT NN IN … NNP O O O O O O O … LOC O O O O O O O … http://dbpedia.org/resource/Paris Giuseppe Rizzo, “Learning with the Web: Structuring data to ease machine understanding” 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -8
    • NER Tools and Web APIs  Standalone software  GATE  Stanford CoreNLP  Temis http://nerd.eurecom.fr/  Web APIs 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 -9
    • NERD: Named Entity Recognition and Disambiguation  Compare performances of NER and NEL tools  Understand strengths and weaknesses of different Web APIs  Adapt NER processing to different context  (Learn how to) Combine NER (/ NEL) tools  Participate in various benchmarks 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 10
    • What is NERD? ontology1 REST API2 UI3 1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl 3 http://nerd.eurecom.fr 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 11
    • Factual comparison of 10 Web NER tools Alchemy API DBpedia Spotlight Evri Extractiv Lupedia Open Calais Saplo Wikimeta Yahoo! Zemanta Language EN,FR, GR,IT, PT,RU, SP,SW EN GR* PT* SP* EN,I T EN EN,FR, IT EN,FR SP EN, SW EN,FR SP EN EN Granularity OEN OEN OED OEN OEN OEN OED OEN OEN OED Entity position N/A char offset N/A word offset range of chars char offset N/A POS offset range of chars N/A Alchemy DBpedia FreeBase Scema.or g Evri DBpedia DBpedia LinkedM DB Open Calais N/A ESTER Yahoo FreeBase Number of classes 324 320 5 34 319 95 5 7 13 81 Response Format JSON MicroF XML RDF HTML JSON RDF XML HTM L JSO N RDF HTML JSON RDF XML HTML JSON RDFa XML JSON MicroF ormat JSON JSON XML JSON XML XML JSON RDF Quota (calls/day) 30000 unl 300 3000 unl 50000 NLP&DBpedia International Workshop, Sydney, October 2013 0 1333 unl 5000 10000 Classification schema 22/10/2013 - 12/15
    • NERD Ontology Aligned the taxonomies used by the extractors 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 13
    • NERD type Building the NERD Ontology Occurrence Person 10 Organization 10 Country Company 6 Continent 5 City 5 RadioStation 5 Album 5 Product 5 ... NLP&DBpedia International Workshop, Sydney, October 2013 6 Location 22/10/2013 - 6 ... - 14
    • NERD REST API RDF /document /user /annotation/{extractor} /extraction /evaluation ... GET, POST, PUT, DELETE JSON “entities” : [{ “entity”: “Tim Berners-Lee” , “type”: “Person” , “uri”: "http://dbpedia.org/resource/Tim_berners_lee", “nerdType”: "http://nerd.eurecom.fr/ontology#Person", “startChar”: 30, “endChar”: 45, “confidence”: 1, “relevance”: 0.5 }] Rizzo G., Troncy R. (2012), NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Web Extraction Tools. In: European chapter of the Association for Computational Linguistics (EACL'12), Avignon, France. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 15
    • NERD meets NIF Model documents through a set of strings deferencable on the Web : offset_23107_ 23110 a str:String ; str:referenceContext :offset_0_26546 . Map string to entity : offset_23107_ 23110 sso:oen dbpedia:W3C. Classification dbpedia:W3C rdf:type nerd:Organization . Rizzo G, Troncy R., Hellmann S. and Bruemmer M. (2012), NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud. In: (LDOW'12) Linked Data on the Web (WWW'12), Lyon, France. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 16
    • NERD User Dashboard 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 17
    • NERD User Interface 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 18
    • History of NER benchmarks  CoNLL 2003 and CoNLL 2005  schema (4 types): person, organization, location and miscellaneous  ACE 2004, ACE 2005 and ACE 2007  schema (7 types): person, organization, location, facility, weapon, vehicle and geo-political entity  entity recognition, co-ref, find relationships among entities extracted  TAC 2009 (Knowledge Base Track)  schema (3 types): person, organization and location  create a knowledge base from the named entities extracted  ETAPE 2012 (Named Entity Task)  schema: Quaero (7 main types, 32 sub-types)  MSM 2013: tweet corpus !  schema (4 types): person, organization, location, miscellaneous 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 19
    • ETAPE 2012 challenge genre train dev test TV news 7h 40m 1h 40m 1h 40m BFM Story, Top QUestions (LCP) TV debates 10h 30m 5h 10m 5h 10m Pile et Face, Ca vous regarde, Entre les lignes (LCP) 1h 05m 1h 05m La place du village (TV8) TV amusements - sources Train Dev Eval Item length 26h 10h 55m 10h 55m Nb files 44 15 15 Nb words 290517 91656 115511 Nb Named Entities 46763 14398 13055 Nb unique categories 33 33 33 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 20
    • NERD @ ETAPE (naïve combined strategy) extraction (eA1,tA1,URIA1,siA1,eiA1) ... (eA2,tA2,URIA2,siA2,eiA2) (eA3,tA3,URIA3,siA3,eiA3) ... ... cleaning fusion ` 22/10/2013 - (eN1,tN1,URIN1,siN1,eiN1) (eN2,tN2,URIN2,siN2,eiN2) When at least 2 extractors classify the same entity with a different type then we apply a preferred selection order (empirically defined): Wikimeta, AlchemyAPI, OpenCalais, Lupedia NLP&DBpedia International Workshop, Sydney, October 2013 - 21
    • Participation at ETAPE (combined+ strategy) ETAPE Train & Dev ... Learned model POS tagger Created static rules Apply rules (eA1,tA1,URIA1,siA1,eA1 ) (eA2,tA2,URIA2,siA2,eiA2 ) (e1,t1,URI1,si1,ei1) fusion Conflicts handled by priority selection: own, Wikimeta,AlchemyAPI, OpenCalais,Lupedia (eN1,tN1,URIN1,sN1,eN1) `(e ,t ,URI ,s ,e ) N2 N2 N2 N2 N2 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 22
    • NERD Global results SLR Precision Recall F-measure %correct combined 86.85% 35.31% 17.69% 23.44% 17.69% combined+ 188.81% 15.13% 28.40% 19.45% 28.40% Combined+ : Eval corpus differs substantially from the Train & Dev corpora. The static rules do not fit well the Eval corpora and they introduce classification noise. 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 23
    • Per-extractor results SLR Precision Recall F-measure %correct alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45% lupedia 39.49% 22.87% 1.56% 2.91% 1.56% opencalais 37.47% 41.69% 3.53% 6.49% 3.53% wikimeta 36.67% 19.40% 4.25% 6.95% 4.25% combined (nerd) 86.85% 35.31% 17.69% 23.44% 17.69% combined+ (nerd+) 188.81% 15.13% 28.40% 19.45% 28.40% 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 24
    • 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 25
    • Learning How to Combine NER Extractors 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 26
    • NERD on CoNLL 2003 (NER task) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 27
    • NERD on MSM 2013 (NER task) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 28
    • NERD on MSM 2013 (NEL task) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 29
    • Media Fragment Enricher: http://mfe.synote.org/mfe/ 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 30
    • Linking pieces of knowledge 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 31
    • Linking pieces of knowledge 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 32
    • Named Entities for Video Classification 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 33
    • Workflow 5:Timed Text 6: NEs with time alignment (json) 2: Metadata 7: RDFize (ttl) Media Fragment Enricher Services Metadata & timed-text 1: Video URL NERD Client 3: metadata RDFizator 9: SPARQL query 4:NERDify Video and metadata preview Categorization Triple Store 8: Generate Category Video replay with subtitles and aligned NEs Media Fragment Enricher UI 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 34
    • Channel signature based on NE distribution 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 35
    • 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 36
    • LinkedTV: automatic annotations ... 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 37
    • ... and enrichment for hypervideos CONCEPT IN PLAYER Cubism Expressionism Fauvism FACETS / PROPERTIES OF CONCEPT 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 CONTENT ENRICHMENT - 38
    • Media Fragments and Annotations http://data.linkedtv.eu/medi a/e2899e7f#t=840,900 nerd:Location Casablanca nerd:Location Cafe Rick nerd:Person H. Bogart nerd:Person I. Bergman  Media Fragment URI 1.0     22/10/2013 - Chapters Scenes Shots etc… NLP&DBpedia International Workshop, Sydney, October 2013 - 39
    • Enrichment and Hypervideos nerd:Location Casablanca nerd:Location Cafe Rick nerd:Person H. Bogart Nerd:Person E. Tierney 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 nerd:Person I. Bergman nerd:Location China - 40
    • Media Fragment + Open Annotation + NERD Locator MediaResource OffsetBasedString Annotation MediaFragment Entity Type URL (hyperlink) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 41
    • Towards a Linked Media Layer  Enriching media with media from a closed collection (e.g. BBC archive)  The MediaEval scenario (~ 1697 hours of archived BBC video) http://www.multimediaeval.org/mediaeval2013/hyper2013/  Enriching media with content from the open web  LinkedTV scenarios: white listed web sites for each program  Media Collector for Social Media 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 42
    • Seed video enriched with web content rbbaktuell_20120809 nerd:Location Brandenburg oa
    • Enrichments are Annotations too 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 44
    • Media Finder (named entities clustering) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 45
    • Media Finder (zooming in a cluster) 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 46
    • Media Finder: http://mediafinder.eurecom.fr/  Live Topic Generation from Event Streams  WWW 2013 Demo Session  http://www.youtube.com/watch?v=8iRiwz7cDYY 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 47
    • Credits  Giuseppe Rizzo, Vuk Milicic, José Luis Redondo Garcia (EURECOM)  Thomas Steiner (Google Inc.)  Marieke van Erp (Free University of Amsterdam)  Yunjia Li (University of Southampton)  … and many other students 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 48
    • http://www.slideshare.net/troncy 22/10/2013 - NLP&DBpedia International Workshop, Sydney, October 2013 - 49