SlideShare a Scribd company logo
How to Evaluate NLP Tools for Entity Extraction
Gil Irizarry
VP Engineering
ODSC East 2020
[Apple | Organization] and [Oranges | Fruit]:
BASIS TECHNOLOGY
Agenda
● About Me
● The Problem Space
● Defining the domain
● Assemble a test set
● Annotation Guidelines
● Review of measurement
● Evaluation examples
● Interannotator agreement
● The steps to evaluation
BASIS TECHNOLOGY
About Me
Gil Irizarry - VP Engineering at Basis Technology, responsible for NLP and Text
Analytics software development
https://www.linkedin.com/in/gilirizarry/
Basis Technology - leading provider of software solutions for extracting
meaningful intelligence from multilingual text and digital devices
BASIS TECHNOLOGY
Rosette Capabilities
BASIS TECHNOLOGY
The Problem Space
● You have some text to analyze. Which tool to choose?
● Related question: You have multiple text or data annotators. Which are
doing a good job?
● The questions are made harder by the tools outputting different formats,
analyzing data differently, and annotators interpreting data differently
● Start by defining the problem space
BASIS TECHNOLOGY
Defining the domain
● What space are you in?
● More importantly, in what domain will you evaluate tools?
● Are you:
○ Reading news
○ Scanning patents
○ Looking for financial fraud
BASIS TECHNOLOGY
Assemble a test set
● NLP systems are often trained on a general corpus. Often this corpus
consists of mainstream news articles.
● Do you use this domain or a more specific one?
● If more specific, do you train a custom model?
BASIS TECHNOLOGY
Annotation Guidelines
Examples requiring definition and agreement in guidelines:
● “Alice shook Brenda’s hand when she entered the meeting.” Is “Brenda” or
“Brenda’s” the entity to be extracted (in addition to Alice of course)?
● Are pronouns expected to be extracted and resolved? “She” in the previous
example
● What about tolerance to punctuation? The U.N. vs. the UN
● Should fictitious characters (“Harry Potter”) be tagged as “person”?
● When a location appears within an organization’s name, do you tag the location
and the organization extracted or just the organization (“San Francisco Association
of Realtors”)?
BASIS TECHNOLOGY
Annotation Guidelines
Examples requiring definition and agreement in guidelines:
● Do you tag the name of a person if it is used as a modifier (“Martin Luther King Jr.
Day”)?
● Do you tag “Twitter” in “You could try reaching out to the Twitterverse”?
● Do you tag “Google” in “I googled it, but I couldn’t find any relevant results”?
● When do you include “the” in an entity?
● How do you differentiate between an entity that’s a company name and a product
by the same name? {[ORG]The New York Times} was criticized for an article about
the {[LOC]Netherlands} in the June 4 edition of {[PRO]The New York Times}.
● “Washington and Moscow continued their negotiations.” Are Washington and
Moscow locations or organizations?
BASIS TECHNOLOGY
Annotation Guidelines
Non-entity extraction issues:
● How many levels of sentiment do you expect?
● Ontology and text classification - what categories do you expect?
● For language identification, are dialects identified as separate languages?
What about macrolanguages?
BASIS TECHNOLOGY
Annotation Guidelines
BASIS TECHNOLOGY
Annotation Guidelines
● Map to Universal Dependencies Guidelines where possible:
https://universaldependencies.org/guidelines.html
● Map to DBpedia ontology where possible:
http://mappings.dbpedia.org/server/ontology/classes/
● Map to known database such as Wikidata where possible:
https://www.wikidata.org/wiki/Wikidata:Main_Page
BASIS TECHNOLOGY
Review of measurement: precision
Precision is the fraction of retrieved documents that are relevant to the query
BASIS TECHNOLOGY
Review of measurement: recall
Recall is the fraction of the relevant documents that are successfully retrieved
BASIS TECHNOLOGY
Review of measurement: F-score
F-score is a harmonic mean of precision and recall
Precision and recall are ratios. In this case, a harmonic mean is more
appropriate for an average than an arithmetic mean.
BASIS TECHNOLOGY
Review of measurement: harmonic mean
A harmonic mean returns a single value to combine both precision and recall.
In the below image, a and b map to precision and recall, and H maps to F
score. In this example, note that increasing a would not increase the overall
score.
BASIS TECHNOLOGY
Review of measurement: F-score
Previous example of F score was actually an F1 score, which balances precision
and recall evenly. A more generalized form of F score is:
F2 (β = 2) weights recall higher than precision and F0.5 (β = 0.5) weights
precision higher than recall
BASIS TECHNOLOGY
Review of measurement: AP and MAP
● Average precision is a measure that combines recall and precision for
ranked retrieval results. For one information need, the average precision
is the mean of the precision scores after each relevant document is
retrieved
● Mean average precision is average precision over a range of queries
BASIS TECHNOLOGY
Review of measurement: MUC score
● Message Understanding Conference (MUC) scoring allows for taking
partial success into account
○ Correct: response = key
○ Partial: response ~= key
○ Incorrect: response != key
○ Spurious: key is blank and response is not
○ Missing: response is blank and key is not
○ Noncommittal: key and response are both blank
○ Recall = (correct + (partial x 0.5 )) / possible
○ Precision = (correct+(partial x 0.5)) / actual
○ Undergeneration = missing / possible
○ Overgeneration = spurious / actual
BASIS TECHNOLOGY
Evaluation Examples
As co-sponsor, Tim Cook was seated at a
table with Vogue editor Anna Wintour,
but he made time to get around and see
his other friends, including Uber CEO
Travis Kalanick. Cook's date for the night
was Laurene Powell Jobs, the widow of
Apple cofounder Steve Jobs. Powell
currently runs Emerson Collective, a
company that seeks to make
investments in education. Kalanick
brought a date as well, Gabi Holzwarth, a
well-known violinist.
BASIS TECHNOLOGY
Evaluation Examples - gold standard
As co-sponsor, Tim Cook was seated at a
table with Vogue editor Anna Wintour,
but he made time to get around and see
his other friends, including Uber CEO
Travis Kalanick. Cook's date for the
night was Laurene Powell Jobs, the
widow of Apple cofounder Steve Jobs.
Powell currently runs Emerson
Collective, a company that seeks to make
investments in education. Kalanick
brought a date as well, Gabi Holzwarth,
a well-known violinist.
BASIS TECHNOLOGY
Evaluation Examples - P, R, F
As co-sponsor, Tim Cook was seated at a
table with Vogue editor Anna Wintour,
but he made time to get around and see
his other friends, including Uber CEO
Travis Kalanick. Cook's date for the night
was Laurene Powell Jobs, the widow of
Apple cofounder Steve Jobs. Powell
currently runs Emerson Collective, a
company that seeks to make
investments in education. Kalanick
brought a date as well, Gabi Holzwarth, a
well-known violinist.
● (Green) TP = 6
● (Olive) FP = 1
● (Orange) TN = 3
● (Red) FN = 3
● Precision = 6/7 = .86
● Recall = 6/9 = .66
● F score = .74
BASIS TECHNOLOGY
Evaluation Examples - AP
As co-sponsor, Tim Cook was seated at a
table with Vogue editor Anna Wintour,
but he made time to get around and see
his other friends, including Uber CEO
Travis Kalanick. Cook's date for the night
was Laurene Powell Jobs, the widow of
Apple cofounder Steve Jobs. Powell
currently runs Emerson Collective, a
company that seeks to make
investments in education. Kalanick
brought a date as well, Gabi Holzwarth, a
well-known violinist.
● 1/1 (Green)
● 2/2 (Green)
● 3/3 (Green)
● 0/4 (Red)
● 4/5 (Green)
● 5/6 (Green)
● 0/7 (Red)
● 0/8 (Red)
● 6/9 (Green)
● AP = (1/1 + 2/2 + 3/3 + 4/5 + 5/6 +
6/9) / 6 = .88
BASIS TECHNOLOGY
Evaluation Examples - MUC scoring
Cook's date for the night
was Laurene Powell
Jobs, the widow of Apple
cofounder Steve Jobs.
Cook's date for the night
was Laurene Powell
Jobs, the widow of Apple
cofounder Steve Jobs.
Token Gold Eval
Result
Cook's B-PER B-PER Partial
date O-NONE I-PER
Spurious
for O-NONE O-NONE
Correct
the O-NONE O-NONE
Correct
night O-NONE O-NONE
Correct
was O-NONE O-NONE
Correct
Laurene B-PER B-PER Correct
Powell I-PER I-PER
Correct
Jobs, I-PER O-
NONE Missing
the O-NONE O-NONE
Correct
BASIS TECHNOLOGY
Evaluation Examples - MUC scoring
Possible = Correct + Incorrect + Partial +
Missing = 11 + 1 + 1 + 1 = 14
Actual = Correct + Incorrect + Partial +
Spurious = 11 + 1 + 1 + 2 = 15
Precision = correct + (1/2 partial)) / actual
= 12.5 / 15 = 0.83
Recall = (correct + (1/2 partial)) / possible
= 12.5 / 14 = 0.89
Token Gold Eval
Result
Cook's B-PER B-PER Partial
date O-NONE I-PER
Spurious
for O-NONE O-NONE
Correct
the O-NONE O-NONE
Correct
night O-NONE O-NONE
Correct
was O-NONE O-NONE
Correct
Laurene B-PER B-PER Correct
Powell I-PER I-PER
Correct
Jobs, I-PER O-
NONE Missing
the O-NONE O-NONE
Correct
BASIS TECHNOLOGY
Interannotator Agreement
● Krippendorff ’s alpha is a reliability coefficient developed to measure the
agreement among observers,coders, judges, raters, or measuring
instruments drawing distinctions among typically unstructured
phenomena
● Cohen’s kappa is a measure of the agreement between two raters who
determine which category a finite number of subjects belong to whereby
agreement due to chance is factored out
● Interannotator agreement scoring determines the agreement between
different annotators annotating the same unstructured text
● It is not intended to measure the output of a tool against a gold standard
BASIS TECHNOLOGY
The Steps to Evaluation
● Define your requirements
● Assemble a valid test dataset
● Annotate the gold standard test dataset
● Get output from tools
● Evaluate the results
● Make your decision
BASIS TECHNOLOGY
Thank You!

More Related Content

Similar to [Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

Cognitive Load Theory for PPT
Cognitive Load Theory for PPT Cognitive Load Theory for PPT
Cognitive Load Theory for PPT
Swagateeka Panigrahy
 
The answer is b
The answer is bThe answer is b
The answer is b
Martin Bailey
 
Hiring & Onboarding in Turbulent Times FTW - Interaction23.pdf
Hiring & Onboarding in Turbulent Times FTW - Interaction23.pdfHiring & Onboarding in Turbulent Times FTW - Interaction23.pdf
Hiring & Onboarding in Turbulent Times FTW - Interaction23.pdf
Russ U
 
Owning Your Communications
Owning Your CommunicationsOwning Your Communications
Owning Your Communications
jcravens
 
ORA Workshop
ORA WorkshopORA Workshop
ORA Workshop
Marshall Karp
 
StangaOne1_A_Guide_for_Everyone
StangaOne1_A_Guide_for_EveryoneStangaOne1_A_Guide_for_Everyone
StangaOne1_A_Guide_for_Everyone
Blagovesta Hristova
 
Pm and tm sofia
Pm and tm sofiaPm and tm sofia
Pm and tm sofia
Nikolay Stoyanov
 
Pm and tm sofial
Pm and tm sofialPm and tm sofial
Pm and tm sofial
Nikolay Stoyanov
 
Learning experiences and learning design
Learning experiences and learning designLearning experiences and learning design
Learning experiences and learning design
Gail Matthews-DeNatale
 
psct.pdf
psct.pdfpsct.pdf
psct.pdf
RedhaElhuni
 
Problem Solving & Critical Thinking Skills
Problem Solving & Critical Thinking SkillsProblem Solving & Critical Thinking Skills
Problem Solving & Critical Thinking Skills
Hj Mohamad Idrakisyah
 
Managing a project.pptx
Managing a project.pptxManaging a project.pptx
Managing a project.pptx
claire74331
 
Career path worksheet
Career path worksheetCareer path worksheet
Career path worksheet
Olga Sergeeva
 
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
joyjonna282
 
Focus Group Presentation
Focus Group PresentationFocus Group Presentation
Focus Group Presentation
Christian Ehrhart
 
Introduction
IntroductionIntroduction
Introduction
Sharon Keenan M.Ed.
 
How to present creative 2017
How to present creative   2017How to present creative   2017
How to present creative 2017
Joel Eby
 
Pert
PertPert
Meetup Python Gascoigne Partners
Meetup Python Gascoigne PartnersMeetup Python Gascoigne Partners
Meetup Python Gascoigne Partners
Gascoigne Partners
 
Building an EVP from scratch, in-house, super fast, for free, based on real d...
Building an EVP from scratch, in-house, super fast, for free, based on real d...Building an EVP from scratch, in-house, super fast, for free, based on real d...
Building an EVP from scratch, in-house, super fast, for free, based on real d...
LinkedIn Talent Solutions
 

Similar to [Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction (20)

Cognitive Load Theory for PPT
Cognitive Load Theory for PPT Cognitive Load Theory for PPT
Cognitive Load Theory for PPT
 
The answer is b
The answer is bThe answer is b
The answer is b
 
Hiring & Onboarding in Turbulent Times FTW - Interaction23.pdf
Hiring & Onboarding in Turbulent Times FTW - Interaction23.pdfHiring & Onboarding in Turbulent Times FTW - Interaction23.pdf
Hiring & Onboarding in Turbulent Times FTW - Interaction23.pdf
 
Owning Your Communications
Owning Your CommunicationsOwning Your Communications
Owning Your Communications
 
ORA Workshop
ORA WorkshopORA Workshop
ORA Workshop
 
StangaOne1_A_Guide_for_Everyone
StangaOne1_A_Guide_for_EveryoneStangaOne1_A_Guide_for_Everyone
StangaOne1_A_Guide_for_Everyone
 
Pm and tm sofia
Pm and tm sofiaPm and tm sofia
Pm and tm sofia
 
Pm and tm sofial
Pm and tm sofialPm and tm sofial
Pm and tm sofial
 
Learning experiences and learning design
Learning experiences and learning designLearning experiences and learning design
Learning experiences and learning design
 
psct.pdf
psct.pdfpsct.pdf
psct.pdf
 
Problem Solving & Critical Thinking Skills
Problem Solving & Critical Thinking SkillsProblem Solving & Critical Thinking Skills
Problem Solving & Critical Thinking Skills
 
Managing a project.pptx
Managing a project.pptxManaging a project.pptx
Managing a project.pptx
 
Career path worksheet
Career path worksheetCareer path worksheet
Career path worksheet
 
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
 
Focus Group Presentation
Focus Group PresentationFocus Group Presentation
Focus Group Presentation
 
Introduction
IntroductionIntroduction
Introduction
 
How to present creative 2017
How to present creative   2017How to present creative   2017
How to present creative 2017
 
Pert
PertPert
Pert
 
Meetup Python Gascoigne Partners
Meetup Python Gascoigne PartnersMeetup Python Gascoigne Partners
Meetup Python Gascoigne Partners
 
Building an EVP from scratch, in-house, super fast, for free, based on real d...
Building an EVP from scratch, in-house, super fast, for free, based on real d...Building an EVP from scratch, in-house, super fast, for free, based on real d...
Building an EVP from scratch, in-house, super fast, for free, based on real d...
 

More from Gil Irizarry

A Rose By Any Other Name.pdf
A Rose By Any Other Name.pdfA Rose By Any Other Name.pdf
A Rose By Any Other Name.pdf
Gil Irizarry
 
Ai for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLPAi for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLP
Gil Irizarry
 
DevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with ContainersDevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with Containers
Gil Irizarry
 
Towards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name MatchingTowards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name Matching
Gil Irizarry
 
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
Gil Irizarry
 
Beginning Native Android Apps
Beginning Native Android AppsBeginning Native Android Apps
Beginning Native Android Apps
Gil Irizarry
 
From Silos to DevOps: Our Story
From Silos to DevOps:  Our StoryFrom Silos to DevOps:  Our Story
From Silos to DevOps: Our Story
Gil Irizarry
 
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Gil Irizarry
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the Go
Gil Irizarry
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps Quickly
Gil Irizarry
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12
Gil Irizarry
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011
Gil Irizarry
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Gil Irizarry
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11
Gil Irizarry
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to Kanban
Gil Irizarry
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of Scrums
Gil Irizarry
 

More from Gil Irizarry (16)

A Rose By Any Other Name.pdf
A Rose By Any Other Name.pdfA Rose By Any Other Name.pdf
A Rose By Any Other Name.pdf
 
Ai for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLPAi for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLP
 
DevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with ContainersDevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with Containers
 
Towards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name MatchingTowards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name Matching
 
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
 
Beginning Native Android Apps
Beginning Native Android AppsBeginning Native Android Apps
Beginning Native Android Apps
 
From Silos to DevOps: Our Story
From Silos to DevOps:  Our StoryFrom Silos to DevOps:  Our Story
From Silos to DevOps: Our Story
 
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the Go
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps Quickly
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to Kanban
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of Scrums
 

Recently uploaded

社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 

Recently uploaded (20)

社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 

[Apple|organization] and [oranges|fruit]: How to evaluate NLP tools for entity extraction

  • 1. How to Evaluate NLP Tools for Entity Extraction Gil Irizarry VP Engineering ODSC East 2020 [Apple | Organization] and [Oranges | Fruit]:
  • 2. BASIS TECHNOLOGY Agenda ● About Me ● The Problem Space ● Defining the domain ● Assemble a test set ● Annotation Guidelines ● Review of measurement ● Evaluation examples ● Interannotator agreement ● The steps to evaluation
  • 3. BASIS TECHNOLOGY About Me Gil Irizarry - VP Engineering at Basis Technology, responsible for NLP and Text Analytics software development https://www.linkedin.com/in/gilirizarry/ Basis Technology - leading provider of software solutions for extracting meaningful intelligence from multilingual text and digital devices
  • 5. BASIS TECHNOLOGY The Problem Space ● You have some text to analyze. Which tool to choose? ● Related question: You have multiple text or data annotators. Which are doing a good job? ● The questions are made harder by the tools outputting different formats, analyzing data differently, and annotators interpreting data differently ● Start by defining the problem space
  • 6. BASIS TECHNOLOGY Defining the domain ● What space are you in? ● More importantly, in what domain will you evaluate tools? ● Are you: ○ Reading news ○ Scanning patents ○ Looking for financial fraud
  • 7. BASIS TECHNOLOGY Assemble a test set ● NLP systems are often trained on a general corpus. Often this corpus consists of mainstream news articles. ● Do you use this domain or a more specific one? ● If more specific, do you train a custom model?
  • 8. BASIS TECHNOLOGY Annotation Guidelines Examples requiring definition and agreement in guidelines: ● “Alice shook Brenda’s hand when she entered the meeting.” Is “Brenda” or “Brenda’s” the entity to be extracted (in addition to Alice of course)? ● Are pronouns expected to be extracted and resolved? “She” in the previous example ● What about tolerance to punctuation? The U.N. vs. the UN ● Should fictitious characters (“Harry Potter”) be tagged as “person”? ● When a location appears within an organization’s name, do you tag the location and the organization extracted or just the organization (“San Francisco Association of Realtors”)?
  • 9. BASIS TECHNOLOGY Annotation Guidelines Examples requiring definition and agreement in guidelines: ● Do you tag the name of a person if it is used as a modifier (“Martin Luther King Jr. Day”)? ● Do you tag “Twitter” in “You could try reaching out to the Twitterverse”? ● Do you tag “Google” in “I googled it, but I couldn’t find any relevant results”? ● When do you include “the” in an entity? ● How do you differentiate between an entity that’s a company name and a product by the same name? {[ORG]The New York Times} was criticized for an article about the {[LOC]Netherlands} in the June 4 edition of {[PRO]The New York Times}. ● “Washington and Moscow continued their negotiations.” Are Washington and Moscow locations or organizations?
  • 10. BASIS TECHNOLOGY Annotation Guidelines Non-entity extraction issues: ● How many levels of sentiment do you expect? ● Ontology and text classification - what categories do you expect? ● For language identification, are dialects identified as separate languages? What about macrolanguages?
  • 12. BASIS TECHNOLOGY Annotation Guidelines ● Map to Universal Dependencies Guidelines where possible: https://universaldependencies.org/guidelines.html ● Map to DBpedia ontology where possible: http://mappings.dbpedia.org/server/ontology/classes/ ● Map to known database such as Wikidata where possible: https://www.wikidata.org/wiki/Wikidata:Main_Page
  • 13. BASIS TECHNOLOGY Review of measurement: precision Precision is the fraction of retrieved documents that are relevant to the query
  • 14. BASIS TECHNOLOGY Review of measurement: recall Recall is the fraction of the relevant documents that are successfully retrieved
  • 15. BASIS TECHNOLOGY Review of measurement: F-score F-score is a harmonic mean of precision and recall Precision and recall are ratios. In this case, a harmonic mean is more appropriate for an average than an arithmetic mean.
  • 16. BASIS TECHNOLOGY Review of measurement: harmonic mean A harmonic mean returns a single value to combine both precision and recall. In the below image, a and b map to precision and recall, and H maps to F score. In this example, note that increasing a would not increase the overall score.
  • 17. BASIS TECHNOLOGY Review of measurement: F-score Previous example of F score was actually an F1 score, which balances precision and recall evenly. A more generalized form of F score is: F2 (β = 2) weights recall higher than precision and F0.5 (β = 0.5) weights precision higher than recall
  • 18. BASIS TECHNOLOGY Review of measurement: AP and MAP ● Average precision is a measure that combines recall and precision for ranked retrieval results. For one information need, the average precision is the mean of the precision scores after each relevant document is retrieved ● Mean average precision is average precision over a range of queries
  • 19. BASIS TECHNOLOGY Review of measurement: MUC score ● Message Understanding Conference (MUC) scoring allows for taking partial success into account ○ Correct: response = key ○ Partial: response ~= key ○ Incorrect: response != key ○ Spurious: key is blank and response is not ○ Missing: response is blank and key is not ○ Noncommittal: key and response are both blank ○ Recall = (correct + (partial x 0.5 )) / possible ○ Precision = (correct+(partial x 0.5)) / actual ○ Undergeneration = missing / possible ○ Overgeneration = spurious / actual
  • 20. BASIS TECHNOLOGY Evaluation Examples As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist.
  • 21. BASIS TECHNOLOGY Evaluation Examples - gold standard As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist.
  • 22. BASIS TECHNOLOGY Evaluation Examples - P, R, F As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist. ● (Green) TP = 6 ● (Olive) FP = 1 ● (Orange) TN = 3 ● (Red) FN = 3 ● Precision = 6/7 = .86 ● Recall = 6/9 = .66 ● F score = .74
  • 23. BASIS TECHNOLOGY Evaluation Examples - AP As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist. ● 1/1 (Green) ● 2/2 (Green) ● 3/3 (Green) ● 0/4 (Red) ● 4/5 (Green) ● 5/6 (Green) ● 0/7 (Red) ● 0/8 (Red) ● 6/9 (Green) ● AP = (1/1 + 2/2 + 3/3 + 4/5 + 5/6 + 6/9) / 6 = .88
  • 24. BASIS TECHNOLOGY Evaluation Examples - MUC scoring Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Token Gold Eval Result Cook's B-PER B-PER Partial date O-NONE I-PER Spurious for O-NONE O-NONE Correct the O-NONE O-NONE Correct night O-NONE O-NONE Correct was O-NONE O-NONE Correct Laurene B-PER B-PER Correct Powell I-PER I-PER Correct Jobs, I-PER O- NONE Missing the O-NONE O-NONE Correct
  • 25. BASIS TECHNOLOGY Evaluation Examples - MUC scoring Possible = Correct + Incorrect + Partial + Missing = 11 + 1 + 1 + 1 = 14 Actual = Correct + Incorrect + Partial + Spurious = 11 + 1 + 1 + 2 = 15 Precision = correct + (1/2 partial)) / actual = 12.5 / 15 = 0.83 Recall = (correct + (1/2 partial)) / possible = 12.5 / 14 = 0.89 Token Gold Eval Result Cook's B-PER B-PER Partial date O-NONE I-PER Spurious for O-NONE O-NONE Correct the O-NONE O-NONE Correct night O-NONE O-NONE Correct was O-NONE O-NONE Correct Laurene B-PER B-PER Correct Powell I-PER I-PER Correct Jobs, I-PER O- NONE Missing the O-NONE O-NONE Correct
  • 26. BASIS TECHNOLOGY Interannotator Agreement ● Krippendorff ’s alpha is a reliability coefficient developed to measure the agreement among observers,coders, judges, raters, or measuring instruments drawing distinctions among typically unstructured phenomena ● Cohen’s kappa is a measure of the agreement between two raters who determine which category a finite number of subjects belong to whereby agreement due to chance is factored out ● Interannotator agreement scoring determines the agreement between different annotators annotating the same unstructured text ● It is not intended to measure the output of a tool against a gold standard
  • 27. BASIS TECHNOLOGY The Steps to Evaluation ● Define your requirements ● Assemble a valid test dataset ● Annotate the gold standard test dataset ● Get output from tools ● Evaluate the results ● Make your decision

Editor's Notes

  1. Rosette is a full NLP stack from language identification to morphology to entity extraction and resolution
  2. One tool will output 5 levels of sentiment and another only 3. One tool will output transitive vs. intransitive verbs and another will output only verbs. One will strip possessives (King’s Landing) and another won’t.
  3. Finding data is easier but annotating data is hard
  4. The Ukraine is now Ukraine, similarly Sudan. How do you handle the change over time?
  5. Screenshot of the TOC of our Annotation Guidelines. 42 pages. In some meetings, it’s the only doc under NDA. Header says for all. That means for all languages. We also have specific guidelines for some languages.
  6. Images from wikipedia
  7. Images from wikipedia
  8. A harmonic mean is a better balance of two values than a simple average
  9. Increasing A would lower the overall score, since both G and H would get smaller
  10. Changing the beta value allows you to tune the harmonic mean and weight either precision or recall more heavily
  11. https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940-9_482 Precision is a single value. Average precision takes into account precision over a range of results. Mean average precision is the mean over a range of queries.
  12. http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/ https://pdfs.semanticscholar.org/f898/e821bbf4157d857dc512a85f49610638f1aa.pdf
  13. Annotated sample of people names. Note “Cook’s” and “Powell” as references to earlier names. Note the “Emerson Collective” as an organization name is not highlighted.
  14. Precision = TP / (TP + FP), Recall = TP / (TP + FN) , F = 2*((P * R)/(P + R))
  15. AP = (sum of (True Positive / Predicted Positive)) / num of True Positive MAP = is the mean of AP over a range of different queries, for example varying the tolerances or confidences
  16. https://pdfs.semanticscholar.org/f898/e821bbf4157d857dc512a85f49610638f1aa.pdf http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/
  17. Possible: The number of entities hand-annotated in the gold evaluation corpus, equal to (Correct + Incorrect + Partial + Missing) Actual: The number of entities tagged by the test NER system, equal to (Correct + Incorrect + Partial + Spurious) (R) Recall = (correct + (1/2 partial)) / possible (P) Precision = (correct + (1/2 partial)) / actual F =(2 * P * R) / (P + R)
  18. http://www.real-statistics.com/reliability/interrater-reliability/cohens-kappa/