Your SlideShare is downloading. ×
Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Mentions in Text
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Mentions in Text

1,149
views

Published on

PhD defense held at Kno.e.sis Center, Wright State University, December 03, 2013.

PhD defense held at Kno.e.sis Center, Wright State University, December 03, 2013.

Published in: Education, Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,149
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Adaptive Semantic Annotation of Entity and Concept Mentions in Text Pablo N. Mendes PhD dissertation defense Ohio Center of Excellence in Knowledge-enabled Computing (kno.e.sis) Wright State University Dayton, OH
  • 2. Introductions and Thank you!
  • 3. Outline ● Introduction, Motivation, Background – KB Tagging, Annotation as a Service ● Conceptual Model ● Knowledge Base: DBpedia ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies – tweets, audio transcripts, educational material
  • 4. Outline ● Introduction, Motivation, Background – KBT: Knowledge Base Tagging of Text – AaaS: Annotation as a Service – Adaptability ● Conceptual Model ● Knowledge Base: DBpedia ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies
  • 5. KBT, informally ● Knowledge Base Tagging (KBT) ● A developer needs to – “extract entities”, – “identify what is mentioned”, – “connect to knowledge bases”. ● He/she is not an NLP or IE expert ● Would like to reuse as much as possible ● May have limited computational resources → Annotation as a Service (AaaS) 5
  • 6. LOCATION DATE Named Entity Recognition (NER) TIME LOCATION On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor. Keyphrase Extraction (KE) On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor. fire airport Düsseldorf, Germany Automatic Term Recognition (ATR) On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor. fire passenger terminal sparks arrivals hall ceiling Wikification (WKF) On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor. Entity Linking (EL) Düsseldorf LOCATION ID:4213421 6
  • 7. Related Work Semantic Voquette SCORE Semagix Freedom My work AIDA / Yago Illinois Wikifier TagMe NER ATR Wikification KE SemTag Syntactic Domain-specific Web content Auto-extracted facts Community generated Cross-domain Multilingual 7
  • 8. Related Work (commercial)
  • 9. Adaptability ● – – – News ● Scientific literature Each developer may have a different application in mind ● different input and output “get key topics for summarization?” “exhaustive tagging for semantic search?” There is no one-size-fits all. But can we support adaptation to different “fits”? Tweets Audio transcripts Query keywords New terms Important phrases Named Entities Concepts related to an objective
  • 10. Requirements ● Transparent process – ● Clear understanding of where things are working or failing Adaptable process – – ● Ability to exchange individual components in order to achieve different goals Ability to modify the behavior of existing components Adaptable to different inputs 10
  • 11. Outline ● Introduction, Motivation, Background ● Conceptual Model ● Knowledge Base: DBpedia ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies ● Conclusion
  • 12. A Conceptual Model of KBT User (Creator) LOCATION DATE LOCATION On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor. System KB Phrase Recognition Candidate Selection feedback Editor Disambiguation Tagging Objective Annotations User (Consumer) 12
  • 13. KBT and Related Tasks (0.87) LOCATION On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor. LOCATION DATE Spark_(fire) Extraction Task Outcome ATR KE NER WSD WKF KBT x EL x Recognize known terms Recognize new terms x x x (NIL) x x Classify ontological type x / x Resolve ambiguity Measure importance/relevance Tag each occurrence x x (to domain) x x x x (to text) x x x x x 13
  • 14. Novelty in the model ● Users and objective are explicit in the model – Knowledge about content creators provides context for new types of KBT – Knowledge about consumer and objective for customizing output – Using feedback to learn from mistakes
  • 15. Outline ● Introduction, Motivation, Background ● Conceptual Model ● Knowledge Base: DBpedia ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies
  • 16. Wikipedia Extraction
  • 17. Knowledge Base ● DBpedia is a cross-domain KB extracted from Wikipedia [Auer et al. 2007, Bizer et al. 2009] – – ● ● Describes 3.7M things through 400M facts Use an ontology of 320 classes and 1,650 [Lehmann et al. 2013] properties DBpedia Live keeps DBpedia up-to-date with Wikipedia changes [Hellmann et al., 2009][Morsey et al., 2012] A whole ecosystem with an active community 17
  • 18. DBpedia Extraction Framework Added new extractors to support KBT: - Thematic Concepts - Topical signatures - Distributional Semantic Model statistics for semantic relatedness [with Lehmann et al. @ SWJ 2013] 18
  • 19. Outline ● Introduction, Motivation, Background ● Conceptual Model ● Knowledge Base: DBpedia ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies ● Conclusion
  • 20. System: default workflow ● Phrase Recognition: – ● Candidate Selection: – ● detecting possible senses for a surface form Disambiguation: – ● mention recognition (e.g. NER) choosing (ranking/classifying) one sense for a mention Tagging: – deciding if should annotate: to account for entities not in the KB, or uninformative annotations. 20
  • 21. (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps. Contextual Relatedness 0.10 0.34 0.22 0.23 0.67 0.45 0.56 0.01 0.33 0.07 Phrase Recognition New York (magazine) Candidate New York Selection Manhattan Province of New York New York City New York, New York (film) New York metropolitan area West New York, New Jersey Roman Catholic Archdiocese of New York Pennsylvania Station (New York City) New York City Disambiguation “New York” type: city, pos: 78 relevance: 0.67, ... Tagging 21
  • 22. 22 A quick example 22
  • 23. 23 Show Top-K Candidates LSU_Tigers Louisiana State University 23
  • 24. Virtuous Cycle Through Sztakipedia toolbar, - DBpedia Spotlight suggests links - to Wikipedia Editors - catalyzes evolution of the knowledge source. /feedback service - allows users to submit judgements - enables system evolution with feedback - also on blogs, etc. with RDFaCE [Khalili] [with Héder @ WWW'2012] 24
  • 25. Contextual relatedness score: TF*ICF [Mendes et al. @ ISEM2011] TF*IDF (Term Freq. * Inverse Doc. Freq.) TF: relevance of a word in the context of a DBpedia Resource IDF: words that are too common are less useful ICF: Inverse Candidate Frequency Entropy-inspired ICF is the rarity of a word with relation to the possible senses Washington, DC George Washington W={“president”,”USA”,...} Washington State “Washington” W={“capital”,”USA”,...} W={“Seattle”,”USA”,...} ICF(“Washington”,”USA”) < ICF(“Washington”,”Seattle”) 25
  • 26. Outline ● Introduction, Motivation, Background ● Conceptual Model ● Knowledge Base ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies ● Conclusion
  • 27. Core Evaluations (…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps. Contextual Relatedness 0.10 0.34 0.22 0.23 0.67 0.45 0.56 0.01 0.33 0.07 Phrase Recognition New York (magazine) Candidate New York Selection Manhattan Province of New York New York City New York, New York (film) New York metropolitan area West New York, New Jersey Roman Catholic Archdiocese of New York Pennsylvania Station (New York City) New York City “New York” type: city, pos: 78 relevance: 0.67, ... Disambiguation Tagging 27
  • 28. Phrase Recognition Results Policies S = { s | p(s) > cutoff_S } (L) Lexicon-based (LNP*) Lexicon-based with at least one noun (NPL) Noun Phrases, lexicon-lookup (bloom filter) (CW) Lexicon-based removing common words (Kea) Keyphrases (NER) Named Entities Only (NER U NP) N-Grams within Noun Phrases and NEs Take home Different spotting strategies with CSAW dataset - It is not only about importance / relevance - Precision is not important: taken care in steps downstream - Recall is key: a missing phrase at this stage is an overall fail - Simple methods work quite well At LREC'2012. - Combinations of techniques improve results 28
  • 29. Context-Independent Strategies ● NAÏVE – ● Use surface form to build URI: “berlin” → dbpedia:Berlin PROMINENCE – P(u): n(u) / N (what is the ‘popularity’/importance of this URL) ● ● – ● n(u): number of times URI u occurred N: total number of occurrences Intuition: URIs that have appeared a lot are more likely to appear again DEFAULT SENSE – P(u|s): n(u,s) / n(s) ● – n(u,s): number of times URI u occurred with surface form s Intuition: some surface forms are strongly associated to some specific URIs 29
  • 30. Disambiguation ● ● ● ● Preliminary results: With 155,000 randomly selected wikilink samples Balance of common and less prominent concepts(default sense: 55.12%) Highly ambiguous (random: 17.77%) At I-Semantics 2011. 30
  • 31. Disambiguation+NIL ● Named Entities Only TACKBP2010 DefaultSense: 79.91% Random: 62.00% Unambiguous: 30.36% NIL accuracy = 79.27 % Non-NIL accuracy = 87.88 % Overall accuracy = 82.71 % At TAC KBP 2011. 31
  • 32. Disambiguation Difficulty ● Geopolitical entities KB: 830K entities ● 311 blog posts, 790 annotations 32
  • 33. Geolocation Disamb. Eval. results - Validates our measure of “difficulty” (performance degrades) - Shows that our system is more robust for disambiguating low dominance entities 33
  • 34. Dominance Analysis 34
  • 35. Dominance Analysis 35
  • 36. Tagging • • Decide which spots to annotate with links to the disambiguated resources Different use cases have different needs – Only annotate prominent resources? – Only if you’re sure disambiguation is correct? – Only people? – Only things related to Berlin? 36
  • 37. Tagging in DBpedia Spotlight • Tagging needs are application/user-specific • Can be configured based on: – Thresholds • • – Confidence Prominence (support) Whitelist or Blacklist of types • – Hide all people, Show only organizations Complex definition of a “type” through a SPARQL query. 37
  • 38. Tagging Evaluation (News) ● Preliminary results Able to approximate best precision and best recall Varying parameters allows to cover a wide range of the P/R trade-off 38
  • 39. Tagging (Take home) ● ● ● ● Combines features from spotting, candidate selection and disambiguation More informed to make decisions Can avoid/fix some mistakes from previous steps Offers a chance to adapt to users' needs 39
  • 40. Outline ● Introduction, Motivation, Background ● Conceptual Model ● Knowledge Base ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies ● Conclusion
  • 41. Case Study: Audio Tagging http://www.bbc.co.uk/programme s 41
  • 42. Example: Audio Transcript ● BBC Audio Archive tag suggestion May 1945 German capital whirlpool or not the b. b. c. witnessed when the jam and capital but then fell to the advancing bad timing in maine nineteen forty five the civilians living there feared to sing and violence steve athens hughes from one woman custom finds that tying it sit beginning of may nineteen forty five but then is being squeezed between the british americans from the west in the russian army from the east but sides fighting for every inch of land and forgets to this city is being pulverized ... ● Tags: Berlin, World War II, Russian Army, etc. Raimond & Lowis, LDOW2012. 42
  • 43. Scenario: Audio Transcript Tagging audio Audio Creator Editor No punctuation or capitalization High token transcription error rates transcript Textual Content Creator (System) KB System Adapted Workflow Phrase Recognition editorial tags automated tags 2. Mention detection (dictionary-based) Candidate Selection Disambiguation Tagging 1. Contextual relatedness 3. Entity type preference-based reranking 43
  • 44. Tagging Audio Transcripts ● Traditional NER features are missing – ● Lexicon-based lookup is also difficult – ● Sentence boundaries, POS tags, 50% token error, etc. “big date” → big data Our approach: – On-the-fly adaptation – Skip spotting, focus on named entities – Preliminary results: – TopN = 0.19 – 0.21 44
  • 45. Case Study: Tweet NER ● NER challenges – informal text, faulty grammar, misspellings, short text, irregular capitalization, etc. – tweet Segmentation harder than classification Creator KBT System KB ● Phrase Recognition Candidate Selection Disambiguation Our approach: – distant supervision from DBpedia – DBpedia Spotlight tagging used as features Tagging Retrained CRF recognizer tags as features Entity mentions 45
  • 46. Tweet NER Results ● ● KBT tags added as features to a Linear chain CRF tagger NER improves with distant supervision from KBT 46
  • 47. Educational Material ● Emergency Management Training “tags that summarize what happened” “configuration parameters allowed removing tags that were 'too general'” 47
  • 48. 48 Case Study: Smart Filtering Microposts mentioning competitors Some User @someuser 4 Nov At home I have an IPad and my bro has a Microsoft Surface. Another User @anotheruser The Asus Transformer Infinity is actually quite nifty. How to look for competitors? 5 Nov Annotations https://twitter.com/someuser/status/123 Knowledge Base mentions IPad category:Wi-Fi https://twitter.com/anotheruser/status/456 Microsoft Surface category:Touchscreen menti ons Asus Transformer Infinity SMART FILTERING SELECT ?tweet mentions ?product belongs ?category [Mendes et al. WI'2010 and Triplification Challenge 2010] belongs IPad 48
  • 49. Case Study: Website Tagging Evaluation: retrieving similar sites Consumer KBT System KB Website Similarity 49 Objective
  • 50. Outline ● Introduction, Motivation, Background ● Conceptual Model ● Knowledge Base ● System: DBpedia Spotlight ● Core Evaluations ● Case Studies ● Conclusion
  • 51. Conclusion ● Model enables cross-task evaluations – ● KE, NER, etc. can be reused for KBT but individually often do not suffice Model enables deeper evaluations (beyond “black box”) – – ● Prescribes modularized evaluation to identify steps that need improvement Introduces and validates a measure of “difficulty to disambiguate” System adapts well to very distinct use cases 51
  • 52. Limitations ● What the proposed model is not: – A silver bullet for all problems – A substitute for machine learning or expert knowledge or linguistics research 52
  • 53. Extensions to DBpedia ● ● ● ● We extended DBpedia to enable KBT Created new extractors for necessary data / statistics Multilinguality: community process to maintain international chapters Results: – Data to power the computation of features necessary for adaptive KBT – Prominence, relevance, pertinence, types, etc. – All reusable to other systems that use DBpedia 53
  • 54.  Demo: DBpedia Spotlight – http://spotlight.dbpedia.org/demo/  Web Service: - http://spotlight.dbpedia.org/rest/{component}  Components are exposed as services: – Phrase Recognition (/spot), – Disambiguation (/disambiguation) – Top K disambiguations (/candidates) – Relatedness (/related) – Annotation (/annotation)  Source code: https://github.com/dbpedia-spotlight/dbpedia-spotlight/  Apache V2 License 54
  • 55. My Ph.D. in retrospect This dissertation Evolution WebSci'10 Knowledge base tagging WWW'12a KCAP'11 LREC'12a ISEM'13 Cross-domain Entity Recognition and Linking Linked Data WWW'12b ISWC'12 ISEM'11 Sieve EDBT'12 Twarql WI'10 MSM'13 CIKM'12 TAC'11 LREC'12b SWJ'13 EvoDyn'12 ISEM'10 Twitris Real-time Information Exploration / Filtering SFSW@ESWC'10 SWC'10 Knowledge-driven Text Exploration Scooner ACMSE'10 Complex Entity Recognition and Relationship Extraction Knowledge-driven Query Formulation Genome databases TcruziDB NAR'06 TcruziKB ICSC'08 BIBM'10 EKAW'08 Cuebee IESD@HT'13 WI'08 Cuadro Garsa ProtozoaDB NAR'08 Bioinformatics'05 55
  • 56. More thanks! … and other mentors and collaborators (too many great people for one slide!)
  • 57. References
  • 58. Other publications ● Bioinformatics IE & Querying – – 2 Nucleic Acid Research Journal – ● 1 Bioinformatics Journal 1 IEEE ICSC, 1 EKAW, 1 Web Intelligence Linked Data Quality and Fusion – ● 1 LWDM 2012 @ EDBT Book chapters – Semantic Search on the Web, with Bizer et al. – The People’s Web Meets NLP, with OKF OWLG 61
  • 59. Impact of my research ● scholar.google.com: 480+ citations, h-index=12 ● Best paper award at I-Semantics 2011 – – 4+2 students on Google Summer of Code 2012+2013 – ● 174 citations (according to scholar.google.com) About 6 open sourced third-party clients Awarded first prize on: – – ● Triplification Challenge 2010 Scripting for Semantic Web Challenge 2010 37 publications – 9 conferences, 5 workshops/posters, 3 magazines (bioinfo) – 2 book chapters – 3 workshop proceedings
  • 60. Leadership and Community Involvement ● Co-organizer of Web of Linked Entities workshop series – ● ● ISWC2012 and WWW2013 Founder of the DBpedia Portuguese initiative, involving volunteers from 5 Brazilian universities Maintainer of 3 open source projects – – ● Twarql: streaming annotated microposts – ● Cuebee: query formulation for RDF DBpedia Spotlight: adaptive semantic annotation PC member in several conferences and workshops: ISWC, ESWC, LREC, LDOW, IJSWIS, LDL'2012, JWS, SWJ, etc. EU projects – leading FUB's participation in PlanetData (FP7 Network of Excellence). – research on LOD2 (FP7 IP). and BIG Public-Private Forum

×