SlideShare a Scribd company logo
How to Evaluate NLP Tools for Entity Extraction
Gil Irizarry
VP Engineering
[Apple | Organization] and [Oranges | Fruit]:
BASIS TECHNOLOGY
About Rosette
Trusted In Critical Applications By
BASIS TECHNOLOGY
About Me
Gil Irizarry - VP Engineering at Basis Technology, responsible for NLP and Text
Analytics software development
https://www.linkedin.com/in/gilirizarry/
https://www.slideshare.net/conoagil
gil@basistech.com
Basis Technology - leading provider of software solutions for extracting
meaningful intelligence from multilingual text and digital devices
BASIS TECHNOLOGY
Agenda
● The problem space
● Defining the domain
● Assemble a test set
● Annotation guidelines
● Review of measurement
● Evaluation examples
● Inter-annotator agreement
● The steps to evaluation
BASIS TECHNOLOGY
Rosette Capabilities
BASIS TECHNOLOGY
The Problem Space
● You have some text to analyze. Which tool to choose?
● Related question: You have multiple text or data annotators. Which are doing
a good job?
● The questions are made harder by the tools outputting different formats,
analyzing data differently, and annotators interpreting data differently
● Start by defining the problem space
BASIS TECHNOLOGY
The Problem Space - Example
Rosette Amazon Comprehend
BASIS TECHNOLOGY
Defining the domain
● What space are you in?
● More importantly, in what domain will you evaluate tools?
● Are you:
○ Reading news
○ Scanning patents
○ Looking for financial fraud
BASIS TECHNOLOGY
Assemble a test set
● NLP systems are often trained on a general corpus. Often this corpus
consists of mainstream news articles.
● Do you use this domain or a more specific one?
● If more specific, do you train a custom model?
BASIS TECHNOLOGY
Annotation Guidelines
Examples requiring definition and agreement in guidelines:
● “Alice shook Brenda’s hand when she entered the meeting.” Is “Brenda” or
“Brenda’s” the entity to be extracted (in addition to Alice of course)?
● Are pronouns expected to be extracted and resolved? “She” in the previous example
● What about tolerance to punctuation? The U.N. vs. the UN
● Should fictitious characters (“Harry Potter”) be tagged as “person”?
● When a location appears within an organization’s name, do you tag the location and the
organization extracted or just the organization (“San Francisco Association of
Realtors”)?
BASIS TECHNOLOGY
Annotation Guidelines
Examples requiring definition and agreement in guidelines:
● Do you tag the name of a person if it is used as a modifier (“Martin Luther King Jr.
Day”)?
● Do you tag “Twitter” in “You could try reaching out to the Twitterverse”?
● Do you tag “Google” in “I googled it, but I couldn’t find any relevant results”?
● When do you include “the” in an entity? The Ukraine vs. Ukraine
● How do you differentiate between an entity that’s a company name and a product by
the same name? {[ORG]The New York Times} was criticized for an article about the
{[LOC]Netherlands} in the June 4 edition of {[PRO]The New York Times}.
● “Washington and Moscow continued their negotiations.” Are Washington and
Moscow locations or organizations?
BASIS TECHNOLOGY
Annotation Guidelines
Non-entity extraction issues:
● How many levels of sentiment do you expect?
● Ontology and text classification - what categories do you expect?
● For language identification, are dialects identified as separate languages?
What about macrolanguages?
BASIS TECHNOLOGY
Annotation Guidelines
BASIS TECHNOLOGY
Annotation Guidelines
● Map to Universal Dependencies Guidelines where possible:
https://universaldependencies.org/guidelines.html
● Map to DBpedia ontology where possible:
http://mappings.dbpedia.org/server/ontology/classes/
● Map to known database such as Wikidata where possible:
https://www.wikidata.org/wiki/Wikidata:Main_Page
BASIS TECHNOLOGY
Review of measurement: precision
Precision is the fraction of retrieved documents that are relevant to the query
BASIS TECHNOLOGY
Review of measurement: recall
Recall is the fraction of the relevant documents that are successfully retrieved
BASIS TECHNOLOGY
Review of measurement: F-score
F-score is a harmonic mean of precision and recall
Precision and recall are ratios. In this case, a harmonic mean is more appropriate
for an average than an arithmetic mean.
BASIS TECHNOLOGY
Review of measurement: harmonic mean
A harmonic mean returns a single value to combine both precision and recall. In
the below image, a and b map to precision and recall, and H maps to F score. In
this example, note that increasing a would not increase the overall score.
BASIS TECHNOLOGY
Review of measurement: F-score
Previous example of F score was actually an F1 score, which balances precision
and recall evenly. A more generalized form of F score is:
F2 (β = 2) weights recall higher than precision and F0.5 (β = 0.5) weights precision
higher than recall
BASIS TECHNOLOGY
Review of measurement: AP and MAP
● Average precision is a measure that combines recall and precision for ranked
retrieval results. For one information need, the average precision is the mean
of the precision scores after each relevant document is retrieved
● Mean average precision is average precision over a range of queries
BASIS TECHNOLOGY
Review of measurement: MUC score
● Message Understanding Conference (MUC) scoring allows for taking partial
success into account
○ Correct: response = key
○ Partial: response ~= key
○ Incorrect: response != key
○ Spurious: key is blank and response is not
○ Missing: response is blank and key is not
○ Noncommittal: key and response are both blank
○ Recall = (correct + (partial x 0.5 )) / possible
○ Precision = (correct+(partial x 0.5)) / actual
○ Undergeneration = missing / possible
○ Overgeneration = spurious / actual
BASIS TECHNOLOGY
Evaluation Examples
As co-sponsor, Tim Cook was seated at a
table with Vogue editor Anna Wintour, but
he made time to get around and see his
other friends, including Uber CEO Travis
Kalanick. Cook's date for the night was
Laurene Powell Jobs, the widow of Apple
cofounder Steve Jobs. Powell currently
runs Emerson Collective, a company that
seeks to make investments in education.
Kalanick brought a date as well, Gabi
Holzwarth, a well-known violinist.
BASIS TECHNOLOGY
Evaluation Examples - gold standard
As co-sponsor, Tim Cook was seated at a
table with Vogue editor Anna Wintour, but
he made time to get around and see his
other friends, including Uber CEO Travis
Kalanick. Cook's date for the night was
Laurene Powell Jobs, the widow of Apple
cofounder Steve Jobs. Powell currently
runs Emerson Collective, a company that
seeks to make investments in education.
Kalanick brought a date as well, Gabi
Holzwarth, a well-known violinist.
BASIS TECHNOLOGY
Evaluation Examples - P, R, F
● (Green) TP = 6
● (Olive) FP = 1
● (Orange) TN = 3
● (Red) FN = 3
● Precision = 6/7 = .86
● Recall = 6/9 = .66
● F score = .74
BASIS TECHNOLOGY
Evaluation Examples - AP
● 1/1 (Green)
● 2/2 (Green)
● 3/3 (Green)
● 0/4 (Red)
● 4/5 (Green)
● 5/6 (Green)
● 0/7 (Red)
● 0/8 (Red)
● 6/9 (Green)
● AP = (1/1 + 2/2 + 3/3 + 4/5 + 5/6 +
6/9) / 6 = .88
BASIS TECHNOLOGY
Evaluation Examples - MUC scoring
Token Gold Eval Result
Cook's B-PER B-PER Partial
Date O-NONE I-PER Spurious
For O-NONE O-NONE Correct
The O-NONE O-NONE Correct
Night O-NONE O-NONE Correct
Was O-NONE O-NONE Correct
Laurene B-PER B-PER Correct
Powell I-PER I-PER Correct
Jobs, I-PER O-NONE Missing
The O-NONE O-NONE Correct
widow O-NONE O-NONE Correct
Of O-NONE O-NONE Correct
Apple O-NONE B-ORG Spurious
cofounder O-NONE O-NONE Correct
Steve B-PER B-PER Correct
Jobs. I-PER I-LOC Incorrect
BASIS TECHNOLOGY
Evaluation Examples - MUC scoring
Possible = Correct + Incorrect + Partial +
Missing = 11 + 1 + 1 + 1 = 14
Actual = Correct + Incorrect + Partial +
Spurious = 11 + 1 + 1 + 2 = 15
Precision = correct + (1/2 partial)) / actual =
12.5 / 15 = 0.83
Recall = (correct + (1/2 partial)) / possible =
12.5 / 14 = 0.89
Token Gold Eval Result
Cook's B-PER B-PER Partial
Date O-NONE I-PER Spurious
For O-NONE O-NONE Correct
The O-NONE O-NONE Correct
Night O-NONE O-NONE Correct
Was O-NONE O-NONE Correct
Laurene B-PER B-PER Correct
Powell I-PER I-PER Correct
Jobs, I-PER O-NONE Missing
The O-NONE O-NONE Correct
widow O-NONE O-NONE Correct
Of O-NONE O-NONE Correct
Apple O-NONE B-ORG Spurious
cofounder O-NONE O-NONE Correct
Steve B-PER B-PER Correct
Jobs. I-PER I-LOC Incorrect
BASIS TECHNOLOGY
Inter-annotator Agreement
● Krippendorff ’s alpha is a reliability coefficient developed to measure the
agreement among observers,coders, judges, raters, or measuring
instruments drawing distinctions among typically unstructured phenomena
● Cohen’s kappa is a measure of the agreement between two raters who
determine which category a finite number of subjects belong to whereby
agreement due to chance is factored out
● Inter-annotator agreement scoring determines the agreement between
different annotators annotating the same unstructured text
● It is not intended to measure the output of a tool against a gold standard
BASIS TECHNOLOGY
The Steps to Evaluation
● Define your requirements
● Assemble a valid test dataset
● Annotate the gold standard test dataset
● Get output from tools
● Evaluate the results
● Make your decision
BASIS TECHNOLOGY
Thank You!
Gil Irizarry
VP Engineering
gil@basistech.com
https://www.linkedin.com/in/gilirizarry/
https://www.slideshare.net/conoagil

More Related Content

Similar to [Apple-organization] and [oranges-fruit] - How to evaluate NLP tools - Basis Webinar

EA Benefits Realization in a Digital World
EA Benefits Realization in a Digital WorldEA Benefits Realization in a Digital World
EA Benefits Realization in a Digital World
Kaine Ugwu
 
How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples)
How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples) How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples)
How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples)
QuekelsBaro
 
Week 5 organization
Week 5 organizationWeek 5 organization
Week 5 organization
Amy Hayashi
 
Confirmations and ContradictionsThe Leontief Paradox, Reco.docx
Confirmations and ContradictionsThe Leontief Paradox, Reco.docxConfirmations and ContradictionsThe Leontief Paradox, Reco.docx
Confirmations and ContradictionsThe Leontief Paradox, Reco.docx
maxinesmith73660
 
Job Analysis For A Job
Job Analysis For A JobJob Analysis For A Job
Job Analysis For A Job
Kate Subramanian
 
The Ultimate Guide to Non-Coding Tech Jobs
The Ultimate Guide to Non-Coding Tech JobsThe Ultimate Guide to Non-Coding Tech Jobs
The Ultimate Guide to Non-Coding Tech Jobs
Jeremy Schifeling
 
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An InnovatorBoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
Business of Software Conference
 
How to Pass an Interview for Software Engineer
How to Pass an Interview for Software EngineerHow to Pass an Interview for Software Engineer
How to Pass an Interview for Software Engineer
suttoantruot
 
Powerpoint CVs
Powerpoint CVsPowerpoint CVs
Powerpoint CVs
raknin
 
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
joyjonna282
 
CHAPTER 3Marketing Decision Making and Case Analysis
CHAPTER 3Marketing Decision Making and Case AnalysisCHAPTER 3Marketing Decision Making and Case Analysis
CHAPTER 3Marketing Decision Making and Case Analysis
EstelaJeffery653
 
Introduction
IntroductionIntroduction
Introduction
Sharon Keenan M.Ed.
 
Make User Experience Part of The KPI Conversation With Universal Measures
Make User Experience Part of The KPI Conversation With Universal MeasuresMake User Experience Part of The KPI Conversation With Universal Measures
Make User Experience Part of The KPI Conversation With Universal Measures
UserZoom
 
How to present creative 2017
How to present creative   2017How to present creative   2017
How to present creative 2017
Joel Eby
 
TTIPEC: Monitoring and Evaluation (Session 2)
TTIPEC: Monitoring and Evaluation (Session 2)TTIPEC: Monitoring and Evaluation (Session 2)
TTIPEC: Monitoring and Evaluation (Session 2)
Research to Action
 
5 Steps to Writing a Resume that Sells
5 Steps to Writing a Resume that Sells5 Steps to Writing a Resume that Sells
5 Steps to Writing a Resume that Sells
ApnaCourse
 
Lecture 9 job analysis
Lecture 9 job analysisLecture 9 job analysis
Lecture 9 job analysis
Chandan Sah
 
STAR (an impromptu speaking technique)
STAR (an impromptu speaking technique)STAR (an impromptu speaking technique)
STAR (an impromptu speaking technique)
Olga Sergeeva
 
ORA Workshop Presentation
ORA Workshop PresentationORA Workshop Presentation
ORA Workshop Presentation
Marshall Karp
 
Building a Peer Evaluation Program
Building a Peer Evaluation ProgramBuilding a Peer Evaluation Program
Building a Peer Evaluation Program
Qualtrics
 

Similar to [Apple-organization] and [oranges-fruit] - How to evaluate NLP tools - Basis Webinar (20)

EA Benefits Realization in a Digital World
EA Benefits Realization in a Digital WorldEA Benefits Realization in a Digital World
EA Benefits Realization in a Digital World
 
How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples)
How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples) How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples)
How to Meet Goals and Inspire Your Team Using OKRs (Includes OKR Examples)
 
Week 5 organization
Week 5 organizationWeek 5 organization
Week 5 organization
 
Confirmations and ContradictionsThe Leontief Paradox, Reco.docx
Confirmations and ContradictionsThe Leontief Paradox, Reco.docxConfirmations and ContradictionsThe Leontief Paradox, Reco.docx
Confirmations and ContradictionsThe Leontief Paradox, Reco.docx
 
Job Analysis For A Job
Job Analysis For A JobJob Analysis For A Job
Job Analysis For A Job
 
The Ultimate Guide to Non-Coding Tech Jobs
The Ultimate Guide to Non-Coding Tech JobsThe Ultimate Guide to Non-Coding Tech Jobs
The Ultimate Guide to Non-Coding Tech Jobs
 
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An InnovatorBoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
 
How to Pass an Interview for Software Engineer
How to Pass an Interview for Software EngineerHow to Pass an Interview for Software Engineer
How to Pass an Interview for Software Engineer
 
Powerpoint CVs
Powerpoint CVsPowerpoint CVs
Powerpoint CVs
 
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx 1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
1 Undergraduate Program Rubric—BACHELOR OF SCIENCE IN CRI.docx
 
CHAPTER 3Marketing Decision Making and Case Analysis
CHAPTER 3Marketing Decision Making and Case AnalysisCHAPTER 3Marketing Decision Making and Case Analysis
CHAPTER 3Marketing Decision Making and Case Analysis
 
Introduction
IntroductionIntroduction
Introduction
 
Make User Experience Part of The KPI Conversation With Universal Measures
Make User Experience Part of The KPI Conversation With Universal MeasuresMake User Experience Part of The KPI Conversation With Universal Measures
Make User Experience Part of The KPI Conversation With Universal Measures
 
How to present creative 2017
How to present creative   2017How to present creative   2017
How to present creative 2017
 
TTIPEC: Monitoring and Evaluation (Session 2)
TTIPEC: Monitoring and Evaluation (Session 2)TTIPEC: Monitoring and Evaluation (Session 2)
TTIPEC: Monitoring and Evaluation (Session 2)
 
5 Steps to Writing a Resume that Sells
5 Steps to Writing a Resume that Sells5 Steps to Writing a Resume that Sells
5 Steps to Writing a Resume that Sells
 
Lecture 9 job analysis
Lecture 9 job analysisLecture 9 job analysis
Lecture 9 job analysis
 
STAR (an impromptu speaking technique)
STAR (an impromptu speaking technique)STAR (an impromptu speaking technique)
STAR (an impromptu speaking technique)
 
ORA Workshop Presentation
ORA Workshop PresentationORA Workshop Presentation
ORA Workshop Presentation
 
Building a Peer Evaluation Program
Building a Peer Evaluation ProgramBuilding a Peer Evaluation Program
Building a Peer Evaluation Program
 

More from Gil Irizarry

A Rose By Any Other Name.pdf
A Rose By Any Other Name.pdfA Rose By Any Other Name.pdf
A Rose By Any Other Name.pdf
Gil Irizarry
 
Ai for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLPAi for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLP
Gil Irizarry
 
DevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with ContainersDevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with Containers
Gil Irizarry
 
Towards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name MatchingTowards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name Matching
Gil Irizarry
 
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
Gil Irizarry
 
Beginning Native Android Apps
Beginning Native Android AppsBeginning Native Android Apps
Beginning Native Android Apps
Gil Irizarry
 
From Silos to DevOps: Our Story
From Silos to DevOps:  Our StoryFrom Silos to DevOps:  Our Story
From Silos to DevOps: Our Story
Gil Irizarry
 
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Gil Irizarry
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the Go
Gil Irizarry
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps Quickly
Gil Irizarry
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12
Gil Irizarry
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011
Gil Irizarry
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Gil Irizarry
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11
Gil Irizarry
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to Kanban
Gil Irizarry
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of Scrums
Gil Irizarry
 

More from Gil Irizarry (16)

A Rose By Any Other Name.pdf
A Rose By Any Other Name.pdfA Rose By Any Other Name.pdf
A Rose By Any Other Name.pdf
 
Ai for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLPAi for Good: Bad Guys, Messy Data, & NLP
Ai for Good: Bad Guys, Messy Data, & NLP
 
DevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with ContainersDevSecOps Orchestration of Text Analytics with Containers
DevSecOps Orchestration of Text Analytics with Containers
 
Towards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name MatchingTowards Identity Resolution: The Challenge of Name Matching
Towards Identity Resolution: The Challenge of Name Matching
 
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
RapidMiner - Don’t Forget to Pack Text Analytics on Your Data Exploration Jou...
 
Beginning Native Android Apps
Beginning Native Android AppsBeginning Native Android Apps
Beginning Native Android Apps
 
From Silos to DevOps: Our Story
From Silos to DevOps:  Our StoryFrom Silos to DevOps:  Our Story
From Silos to DevOps: Our Story
 
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
Make Cross-platform Mobile Apps Quickly - SIGGRAPH 2014
 
Graphics on the Go
Graphics on the GoGraphics on the Go
Graphics on the Go
 
Make Mobile Apps Quickly
Make Mobile Apps QuicklyMake Mobile Apps Quickly
Make Mobile Apps Quickly
 
Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12Building The Agile Enterprise - LSSC '12
Building The Agile Enterprise - LSSC '12
 
Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011Agile The Kanban Way - Central MA PMI 2011
Agile The Kanban Way - Central MA PMI 2011
 
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
Transitioning to Kanban: Theory and Practice - Project Summit Boston 2011
 
Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11Transitioning to Kanban - Aug 11
Transitioning to Kanban - Aug 11
 
Transitioning to Kanban
Transitioning to KanbanTransitioning to Kanban
Transitioning to Kanban
 
Beyond Scrum of Scrums
Beyond Scrum of ScrumsBeyond Scrum of Scrums
Beyond Scrum of Scrums
 

Recently uploaded

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
Adele Miller
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
Deuglo Infosystem Pvt Ltd
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
kalichargn70th171
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
Gerardo Pardo-Castellote
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 

Recently uploaded (20)

May Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdfMay Marketo Masterclass, London MUG May 22 2024.pdf
May Marketo Masterclass, London MUG May 22 2024.pdf
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
Neo4j - Product Vision and Knowledge Graphs - GraphSummit Paris
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Empowering Growth with Best Software Development Company in Noida - Deuglo
Empowering Growth with Best Software  Development Company in Noida - DeugloEmpowering Growth with Best Software  Development Company in Noida - Deuglo
Empowering Growth with Best Software Development Company in Noida - Deuglo
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
DDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systemsDDS-Security 1.2 - What's New? Stronger security for long-running systems
DDS-Security 1.2 - What's New? Stronger security for long-running systems
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 

[Apple-organization] and [oranges-fruit] - How to evaluate NLP tools - Basis Webinar

  • 1. How to Evaluate NLP Tools for Entity Extraction Gil Irizarry VP Engineering [Apple | Organization] and [Oranges | Fruit]:
  • 2. BASIS TECHNOLOGY About Rosette Trusted In Critical Applications By
  • 3. BASIS TECHNOLOGY About Me Gil Irizarry - VP Engineering at Basis Technology, responsible for NLP and Text Analytics software development https://www.linkedin.com/in/gilirizarry/ https://www.slideshare.net/conoagil gil@basistech.com Basis Technology - leading provider of software solutions for extracting meaningful intelligence from multilingual text and digital devices
  • 4. BASIS TECHNOLOGY Agenda ● The problem space ● Defining the domain ● Assemble a test set ● Annotation guidelines ● Review of measurement ● Evaluation examples ● Inter-annotator agreement ● The steps to evaluation
  • 6. BASIS TECHNOLOGY The Problem Space ● You have some text to analyze. Which tool to choose? ● Related question: You have multiple text or data annotators. Which are doing a good job? ● The questions are made harder by the tools outputting different formats, analyzing data differently, and annotators interpreting data differently ● Start by defining the problem space
  • 7. BASIS TECHNOLOGY The Problem Space - Example Rosette Amazon Comprehend
  • 8. BASIS TECHNOLOGY Defining the domain ● What space are you in? ● More importantly, in what domain will you evaluate tools? ● Are you: ○ Reading news ○ Scanning patents ○ Looking for financial fraud
  • 9. BASIS TECHNOLOGY Assemble a test set ● NLP systems are often trained on a general corpus. Often this corpus consists of mainstream news articles. ● Do you use this domain or a more specific one? ● If more specific, do you train a custom model?
  • 10. BASIS TECHNOLOGY Annotation Guidelines Examples requiring definition and agreement in guidelines: ● “Alice shook Brenda’s hand when she entered the meeting.” Is “Brenda” or “Brenda’s” the entity to be extracted (in addition to Alice of course)? ● Are pronouns expected to be extracted and resolved? “She” in the previous example ● What about tolerance to punctuation? The U.N. vs. the UN ● Should fictitious characters (“Harry Potter”) be tagged as “person”? ● When a location appears within an organization’s name, do you tag the location and the organization extracted or just the organization (“San Francisco Association of Realtors”)?
  • 11. BASIS TECHNOLOGY Annotation Guidelines Examples requiring definition and agreement in guidelines: ● Do you tag the name of a person if it is used as a modifier (“Martin Luther King Jr. Day”)? ● Do you tag “Twitter” in “You could try reaching out to the Twitterverse”? ● Do you tag “Google” in “I googled it, but I couldn’t find any relevant results”? ● When do you include “the” in an entity? The Ukraine vs. Ukraine ● How do you differentiate between an entity that’s a company name and a product by the same name? {[ORG]The New York Times} was criticized for an article about the {[LOC]Netherlands} in the June 4 edition of {[PRO]The New York Times}. ● “Washington and Moscow continued their negotiations.” Are Washington and Moscow locations or organizations?
  • 12. BASIS TECHNOLOGY Annotation Guidelines Non-entity extraction issues: ● How many levels of sentiment do you expect? ● Ontology and text classification - what categories do you expect? ● For language identification, are dialects identified as separate languages? What about macrolanguages?
  • 14. BASIS TECHNOLOGY Annotation Guidelines ● Map to Universal Dependencies Guidelines where possible: https://universaldependencies.org/guidelines.html ● Map to DBpedia ontology where possible: http://mappings.dbpedia.org/server/ontology/classes/ ● Map to known database such as Wikidata where possible: https://www.wikidata.org/wiki/Wikidata:Main_Page
  • 15. BASIS TECHNOLOGY Review of measurement: precision Precision is the fraction of retrieved documents that are relevant to the query
  • 16. BASIS TECHNOLOGY Review of measurement: recall Recall is the fraction of the relevant documents that are successfully retrieved
  • 17. BASIS TECHNOLOGY Review of measurement: F-score F-score is a harmonic mean of precision and recall Precision and recall are ratios. In this case, a harmonic mean is more appropriate for an average than an arithmetic mean.
  • 18. BASIS TECHNOLOGY Review of measurement: harmonic mean A harmonic mean returns a single value to combine both precision and recall. In the below image, a and b map to precision and recall, and H maps to F score. In this example, note that increasing a would not increase the overall score.
  • 19. BASIS TECHNOLOGY Review of measurement: F-score Previous example of F score was actually an F1 score, which balances precision and recall evenly. A more generalized form of F score is: F2 (β = 2) weights recall higher than precision and F0.5 (β = 0.5) weights precision higher than recall
  • 20. BASIS TECHNOLOGY Review of measurement: AP and MAP ● Average precision is a measure that combines recall and precision for ranked retrieval results. For one information need, the average precision is the mean of the precision scores after each relevant document is retrieved ● Mean average precision is average precision over a range of queries
  • 21. BASIS TECHNOLOGY Review of measurement: MUC score ● Message Understanding Conference (MUC) scoring allows for taking partial success into account ○ Correct: response = key ○ Partial: response ~= key ○ Incorrect: response != key ○ Spurious: key is blank and response is not ○ Missing: response is blank and key is not ○ Noncommittal: key and response are both blank ○ Recall = (correct + (partial x 0.5 )) / possible ○ Precision = (correct+(partial x 0.5)) / actual ○ Undergeneration = missing / possible ○ Overgeneration = spurious / actual
  • 22. BASIS TECHNOLOGY Evaluation Examples As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist.
  • 23. BASIS TECHNOLOGY Evaluation Examples - gold standard As co-sponsor, Tim Cook was seated at a table with Vogue editor Anna Wintour, but he made time to get around and see his other friends, including Uber CEO Travis Kalanick. Cook's date for the night was Laurene Powell Jobs, the widow of Apple cofounder Steve Jobs. Powell currently runs Emerson Collective, a company that seeks to make investments in education. Kalanick brought a date as well, Gabi Holzwarth, a well-known violinist.
  • 24. BASIS TECHNOLOGY Evaluation Examples - P, R, F ● (Green) TP = 6 ● (Olive) FP = 1 ● (Orange) TN = 3 ● (Red) FN = 3 ● Precision = 6/7 = .86 ● Recall = 6/9 = .66 ● F score = .74
  • 25. BASIS TECHNOLOGY Evaluation Examples - AP ● 1/1 (Green) ● 2/2 (Green) ● 3/3 (Green) ● 0/4 (Red) ● 4/5 (Green) ● 5/6 (Green) ● 0/7 (Red) ● 0/8 (Red) ● 6/9 (Green) ● AP = (1/1 + 2/2 + 3/3 + 4/5 + 5/6 + 6/9) / 6 = .88
  • 26. BASIS TECHNOLOGY Evaluation Examples - MUC scoring Token Gold Eval Result Cook's B-PER B-PER Partial Date O-NONE I-PER Spurious For O-NONE O-NONE Correct The O-NONE O-NONE Correct Night O-NONE O-NONE Correct Was O-NONE O-NONE Correct Laurene B-PER B-PER Correct Powell I-PER I-PER Correct Jobs, I-PER O-NONE Missing The O-NONE O-NONE Correct widow O-NONE O-NONE Correct Of O-NONE O-NONE Correct Apple O-NONE B-ORG Spurious cofounder O-NONE O-NONE Correct Steve B-PER B-PER Correct Jobs. I-PER I-LOC Incorrect
  • 27. BASIS TECHNOLOGY Evaluation Examples - MUC scoring Possible = Correct + Incorrect + Partial + Missing = 11 + 1 + 1 + 1 = 14 Actual = Correct + Incorrect + Partial + Spurious = 11 + 1 + 1 + 2 = 15 Precision = correct + (1/2 partial)) / actual = 12.5 / 15 = 0.83 Recall = (correct + (1/2 partial)) / possible = 12.5 / 14 = 0.89 Token Gold Eval Result Cook's B-PER B-PER Partial Date O-NONE I-PER Spurious For O-NONE O-NONE Correct The O-NONE O-NONE Correct Night O-NONE O-NONE Correct Was O-NONE O-NONE Correct Laurene B-PER B-PER Correct Powell I-PER I-PER Correct Jobs, I-PER O-NONE Missing The O-NONE O-NONE Correct widow O-NONE O-NONE Correct Of O-NONE O-NONE Correct Apple O-NONE B-ORG Spurious cofounder O-NONE O-NONE Correct Steve B-PER B-PER Correct Jobs. I-PER I-LOC Incorrect
  • 28. BASIS TECHNOLOGY Inter-annotator Agreement ● Krippendorff ’s alpha is a reliability coefficient developed to measure the agreement among observers,coders, judges, raters, or measuring instruments drawing distinctions among typically unstructured phenomena ● Cohen’s kappa is a measure of the agreement between two raters who determine which category a finite number of subjects belong to whereby agreement due to chance is factored out ● Inter-annotator agreement scoring determines the agreement between different annotators annotating the same unstructured text ● It is not intended to measure the output of a tool against a gold standard
  • 29. BASIS TECHNOLOGY The Steps to Evaluation ● Define your requirements ● Assemble a valid test dataset ● Annotate the gold standard test dataset ● Get output from tools ● Evaluate the results ● Make your decision
  • 30. BASIS TECHNOLOGY Thank You! Gil Irizarry VP Engineering gil@basistech.com https://www.linkedin.com/in/gilirizarry/ https://www.slideshare.net/conoagil

Editor's Notes

  1. Thank you for joining, while we wait for people to join, I'm going to spend two minutes telling you about Rosette. Rosette is our text analytics brand, we pride ourselves with providing a high quality carefully curated and TESTED set of text analytics and natural language processing capabilities. Testing and evaluation of NLP has become one of our in-house specialties, and a service we provide to customers. This is what inspired Gil's talk today. There is also a "How To Evaluate NLP" series on our blog if you want to read more after this talk. We also pride ourselves with comprehensive NLP coverage. This includes both breadth of capabilities AND in language support. Rosette text analytics enables high quality analytics in over 32 languages. All the Rosette capabilities are highly adaptable, with easy tools for domain adaptation and many options for deployment. We work with our clients to engineer the best possible NLP solution for their needs, using every possible data source to make their AI smart and resilient. Major brands that you know deploy Rosette on-premise and in the cloud for their mission critical, high volume systems. Now let me introduce your host for this talk, Basis Technology's VP of engineering, Gil Irizarry...
  2. Rosette is a full NLP stack from language identification to morphology to entity extraction and resolution. We moving into application development with annotation studio and identity resolution
  3. One tool will output 5 levels of sentiment and another only 3. One tool will output transitive vs. intransitive verbs and another will output only verbs. One will strip possessives (King’s Landing) and another won’t.
  4. Rosette / Amazon Comprehend. Note that Rosette and Comprehend identify titles differently. Comprehend identified CEO as a person and didn’t identify the pronoun.
  5. Finding data is easier but annotating data is hard
  6. The Ukraine is now Ukraine, similarly Sudan. How do you handle the change over time?
  7. Screenshot of the TOC of our Annotation Guidelines. 42 pages. In some meetings, it’s the only doc under NDA. Header says for all. That means for all languages. We also have specific guidelines for some languages.
  8. Images from wikipedia
  9. Images from wikipedia
  10. A harmonic mean is a better balance of two values than a simple average
  11. Increasing A would lower the overall score, since both G and H would get smaller
  12. Changing the beta value allows you to tune the harmonic mean and weight either precision or recall more heavily
  13. https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-39940-9_482 Precision is a single value. Average precision takes into account precision over a range of results. Mean average precision is the mean over a range of queries.
  14. http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/ https://pdfs.semanticscholar.org/f898/e821bbf4157d857dc512a85f49610638f1aa.pdf
  15. Annotated sample of people names. Note “Cook’s” and “Powell” as references to earlier names. Note the “Emerson Collective” as an organization name is not highlighted.
  16. Precision = TP / (TP + FP), Recall = TP / (TP + FN) , F = 2*((P * R)/(P + R))
  17. AP = (sum of (True Positive / Predicted Positive)) / num of True Positive MAP = is the mean of AP over a range of different queries, for example varying the tolerances or confidences
  18. https://pdfs.semanticscholar.org/f898/e821bbf4157d857dc512a85f49610638f1aa.pdf http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/
  19. Possible: The number of entities hand-annotated in the gold evaluation corpus, equal to (Correct + Incorrect + Partial + Missing) Actual: The number of entities tagged by the test NER system, equal to (Correct + Incorrect + Partial + Spurious) (R) Recall = (correct + (1/2 partial)) / possible (P) Precision = (correct + (1/2 partial)) / actual F =(2 * P * R) / (P + R)
  20. http://www.real-statistics.com/reliability/interrater-reliability/cohens-kappa/