This document discusses various evaluation measures used in information retrieval and natural language processing. It describes precision, recall, and the F1 score as fundamental measures for unranked retrieval sets. It also covers averaged precision and recall, accuracy, novelty and coverage ratios. For ranked retrieval sets, it discusses recall-precision graphs, interpolated recall-precision, precision at k, R-precision, ROC curves, and normalized discounted cumulative gain (NDCG). The document also discusses agreement measures like Kappa statistics and parses evaluation measures like Parseval and attachment scores.
Part B CS8391 Data Structures Part B Questions compiled from R2008 & R2013 to help the students of Affiliated Colleges appearing for Ann University Examination
Part B CS8391 Data Structures Part B Questions compiled from R2008 & R2013 to help the students of Affiliated Colleges appearing for Ann University Examination
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
1. Introduction to time and space complexity.
2. Different types of asymptotic notations and their limit definitions.
3. Growth of functions and types of time complexities.
4. Space and time complexity analysis of various algorithms.
At the beginning, the number of elements in a set of numbers to be stored in a computer system used to be not so large or having a wide range. Then, using a
simple table T [0, 1, ..., m − 1]called, direct-address table, could be used to store those numbers. As the situation became more and more complex, and a new idea came to be:
Definition
An associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of tuples {(key, value)}
This can bee seen in the example of dictionaries in any spoken language. The problem became more complex when the range of the possible values for the
keys at the tuples became unbounded. Therefore a new type of data structure is needed to avoid the sparsity problem in the data, the hash table.
Problem solving
Problem formulation
Search Techniques for Artificial Intelligence
Classification of AI searching Strategies
What is Search strategy ?
Defining a Search Problem
State Space Graph versus Search Trees
Graph vs. Tree
Problem Solving by Search
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
Artificial Intelligence: Introduction, Typical Applications. State Space Search: Depth Bounded
DFS, Depth First Iterative Deepening. Heuristic Search: Heuristic Functions, Best First Search,
Hill Climbing, Variable Neighborhood Descent, Beam Search, Tabu Search. Optimal Search: A
*
algorithm, Iterative Deepening A*
, Recursive Best First Search, Pruning the CLOSED and OPEN
Lists
1. Introduction to time and space complexity.
2. Different types of asymptotic notations and their limit definitions.
3. Growth of functions and types of time complexities.
4. Space and time complexity analysis of various algorithms.
At the beginning, the number of elements in a set of numbers to be stored in a computer system used to be not so large or having a wide range. Then, using a
simple table T [0, 1, ..., m − 1]called, direct-address table, could be used to store those numbers. As the situation became more and more complex, and a new idea came to be:
Definition
An associative array, map, symbol table, or dictionary is an abstract data type composed of a collection of tuples {(key, value)}
This can bee seen in the example of dictionaries in any spoken language. The problem became more complex when the range of the possible values for the
keys at the tuples became unbounded. Therefore a new type of data structure is needed to avoid the sparsity problem in the data, the hash table.
Problem solving
Problem formulation
Search Techniques for Artificial Intelligence
Classification of AI searching Strategies
What is Search strategy ?
Defining a Search Problem
State Space Graph versus Search Trees
Graph vs. Tree
Problem Solving by Search
Assessing Model Performance - Beginner's GuideMegan Verbakel
Introduction on how to assess the performance of a classifier model. Covers theories (bias-variance trade-off, over/under-fitting), data preparation (train/test split, cross-validation), common performance plots (e.g. ROC curve and confusion matrix), and common metrics (e.g. accuracy, precision, recall, f1-score).
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
Data science interviews can be particularly difficult due to the many proficiencies that you'll have to demonstrate (technical skills, problem solving, communication) and the generally high bar to entry for the industry.we Provide Top 100+ Google Data Science Interview Questions : All You Need to know to Crack it
visit by :-https://www.datacademy.ai/google-data-science-interview-questions/
Machine Learning Decision Tree AlgorithmsRupak Roy
Details discussion about the Tree Algorithms like Gini, Information Gain, Chi-square for categorical and Reduction in variance for continuous variable. Let me know if anything is required. Happy to help. Enjoy machine learning! #bobrupakroy
This lecture talks about parsing. Briefly gives overview on lexicon, categorization, grammar rules, syntactic tree, word senses and various challenges of natural language processing
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
7. Ways to interpret precision
A measure of the ability of a system to present only
relevant items
The fraction of correct instances among all instances
that the algorithm believes to belong to the relevant
set
It is a measure of exactness or fidelity
It tells how well a system weeds out what you don't
want
Says nothing about the number of false negatives
7
8. Ways to interpret recall
A measure of the ability of a system to present all
relevant items
The fraction of correct instances among all instances
that actually belong to the relevant set
It is a measure of completeness
It tells how well a system performs to get what you
want
Says nothing about the number of false positives
8
9. Precision or recall?
Typical web surfers would like every result of the
search engine on the first page to be relevant (high
precision)
Do they bother if the search engine brings all the
relevant documents (high recall)?
Individuals searching their hard disks are often
interested in high recall searches
9
10. F-Score
A single measure that trades off precision versus recall
is the F measure, which is the weighted harmonic
mean of precision and recall
10
11. F-Score
The default balanced F measure equally weights
precision and recall, which means making
α = 1/2 or
β = 1
The equation of F-Score becomes
11
12. F-Score
However, using an even weighting is not the only
choice
Values of β < 1 emphasize precision
while values of β > 1 emphasize recall.
12
13. F-Score
Say ,
P = 16.20
R = 12.63
If β = 3,
F-Score = 12.91 (closer to recall)
If β = 0.3,
F-Score = 15.82 (closer to precision)
13
14. Why Harmonic Mean?
Reason 1
Say a search can return all the documents with a high
recall of 100%
But when you use it, it gives you 1 document relevant in
10,000 documents (low precision of 0.01%)
If you take arithmetic mean, you will get the F-score
about 50%.
If you take harmonic mean, you will get the F-score
0.02%
14
15. Why Harmonic Mean?
Reason 2
Harmonic mean
is always less than
or equal to the
arithmetic mean
and the
geometric mean.
When the values
of two numbers
differ greatly, the
harmonic mean
is closer to their
minimum than
to their
arithmetic mean 15
16. Why Harmonic Mean?
Reason 3
Precision and recall are ratios.
When you use ratios to calculate average, the most
suitable measure is harmonic mean
16
17. Average precision and recall
Say, on n datasets , you have p1, p2…pn precisions and r1,
r2… rn recalls of your system.
What is the average precision and recall of your system?
Macro averaging method:
computes precision/recall for each test instance first
then averages these statistics over all instances in the
reference standard
Micro averaging method:
The micro-averaging method represents the results where
true positives, false positives and false negatives are added up
across all test instances first
then these counts are used to compute the statistics
17
18. Average precision and recall
Say, your system has the following performance on two
datasets
tp1 = 10, fp1 = 5, fn1 = 3, p = 66.67, r = 76.92
tp2 = 20, fp2 = 4, fn2 = 5, p = 83.33, r = 80.00
Macro p = (66.67 + 83.33)/2 = 75
Macro r = (76.92+80.00)/2 = 78.46
Micro p = (10+20)/[(10+20)+(5+4)]= 76.92
Micro r = (10+20)/[(10+20)+(3+5)] = 78.94
18
19. Average precision and recall
The micro-averaging method favors large
categories with many instances
The macro-averaging method shows how the
classifier performs across all categories
19
20. Accuracy
An obvious alternative that may occur to the reader is
to judge an information retrieval system by its
accuracy
It is the fraction of its classifications that are correct.
20
21. Accuracy
There is a good reason why accuracy is not an appropriate
measure for information retrieval problems.
In almost all circumstances, the data is extremely skewed:
normally over 99.9% of the documents are in the
nonrelevant category.
A system tuned to maximize accuracy can appear to
perform well by simply deeming all documents
nonrelevant to all queries.
Even if the system is quite good, trying to label some
documents as relevant will almost always lead to a high rate
of false positives.
21
23. Measures and equivalent terms
Measures Expression Equivalent Terms
True positive Hit
True negative Correct rejection
False positive Type I error, False alarm rate
False negative Type II error, Miss
Recall tp/ (tp+fn) Sensitivity, True positive rate, Hit rate
Precision tp/ (tp+fp) Positive predictive value (PPV)
False positive rate fp/N = fp/(fp+tn) False alarm rate, Fall out
Accuracy (tp+tn)/(tp+tn+fp+fn)
Specificity tn/N = tn/(fp+tn) True negative rate
Negative predictive value (NPV) tn/(tn+fn)
False discovery rate fp/(fp+tp)
23
24. Some other measures
Novelty ratio
The proportion of items retrieved and judged relevant
by the user and of which they were previously unaware.
Ability to find new information on a topic.
Coverage ratio
The proportion of relevant items retrieved out of the
total relevant documents known to a user prior to the
search.
24
25.
26. Introduction
Precision, recall, and the F measure are set-based
measures.
They are computed using unordered sets of
documents.
We need to extend these measures if we are to evaluate
the ranked retrieval results
standard with search engines.
26
29. Interpolated precision-recall
29
What is the maximum precision for a recall
equal to or greater than this in the first
table?
Answer = 1
What is the maximum precision for a recall
equal to or greater than this in the first
table?
Answer = 4/6
37. Precision at k
This leads to measuring precision at fixed level lower
than the retrieved results
Such as ten (precision at 10) or thirty documents
(precision at 30)
Useful when you don’t know the number of relevant
documents
Least stable of the commonly used measures
Does not average well
37
38. P=3/4=0.75
Precision at k
n doc # relevant
1 588 x
2 589 x
3 576
4 590 x
5 986
6 592 x
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x
14 990 x
Let total # of relevant docs = 6
in 14 extracted docs
P=1/1=1
P=2/2=1
P=4/6=0.667
Precision at k=6 will be 66.7%
But it will drop if you want to measure
Precision at k=7
41. ROC curve
Stands for Receiver Operating Characteristics
Plots true positive rate/ sensitivity/ recall against false
positive rate or (1-specificity)
41
42. ROC curve
Specificity
A sniffer dog looking for drugs would have a low specificity if it is
often led astray by things that aren't drugs - cosmetics or food, for
example.
Specificity can be considered as the percentage of times a test will
correctly identify a negative result.
Also called true negative rate
False positive rate
1 – specificity
1 – (tn/(fp + tn)) = fp/(fp + tn)
42
43. ROC curve
The closer the curve
follows the left-hand
border and then the top
border of the ROC space,
the more accurate the
test.
The closer the curve
comes to the 45-degree
diagonal of the ROC
space, the less accurate
the test.
43
45. Area under the ROC curve
There are many tools that can give you the area under
the curve (AUC) of ROC
If you don’t understand the ability of your system from
ROC curve alone, you can use the AUC instead
.90-1 = excellent
.80-.90 = good
.70-.80 = fair
.60-.70 = poor
.50-.60 = fail
45
46. Cumulative gain
Say you have extracted 6 documents
The relevance of each document is to be judged on a
scale of 0-3 with 0 meaning irrelevant, 3 meaning
completely relevant, and 1 and 2 meaning "somewhere
in between".
The order of your extraction be
D1,D2,D3,D4,D5,D6
Your score on them be
3,2,3,0,1,2
The Cumulative Gain of this search result listing is:
46
48. Normalized DCG (NDCG)
The performance of this query to another is
incomparable
since the other query may have more results, resulting in
a larger overall DCG which may not necessarily be
better.
In order to compare, the DCG values must be
normalized.
48
49. NDCG
To normalize DCG values, an ideal ordering for the
given query is needed.
One ideal ordering can be the documents in ascending
order of their relevance scores
3,3,2,2,1,0
The DCG of this ideal ordering, or IDCG, is then
IDCG6 = 8.693
The nDCG for this query is given as:
49
51. Kappa measure
Suppose that you were analyzing data related to people
applying for a grant.
Each grant proposal was read by two people and each
reader either said "Yes" or "No" to the proposal
Suppose the data were as follows, where rows are
reader A and columns are reader B
51
52. Kappa measure
Note that there were 20 proposals that were granted by
both reader A and reader B, and
15 proposals that were rejected by both readers.
Thus, the observed percentage agreement is
Pr(a)=(20+15)/50 = 0.70.
52
53. Kappa measure
To calculate Pr(e) (the probability of random agreement)
we note that
Reader A said "Yes" to 25 applicants and "No" to 25 applicants.
Thus reader A said "Yes" 50% of the time.
Reader B said "Yes" to 30 applicants and "No" to 20 applicants.
Thus reader B said "Yes" 60% of the time.
53
54. Kappa measure
Therefore the probability that both of them would say "Yes"
randomly is 0.50*0.60=0.30 and
The probability that both of them would say "No" is
0.50*0.40=0.20.
Thus the overall probability of random agreement is
Pr("e") = 0.3+0.2 = 0.5.
54
56. Inconsistencies with Kappa measure
In the following two cases there is equal agreement
between A and B (60 out of 100 in both cases) so we
would expect the relative values of Cohen's Kappa to
reflect this.
56
57. Interpretation of Kappa measures
Kappa is always less than or equal to 1.
A value of 1 implies perfect agreement and values less than 1
imply less than perfect agreement.
In rare situations, Kappa can be negative.
This is a sign that the two observers agreed less than would be
expected just by chance.
Possible interpretations of Kappa (Altman DG. Practical Statistics for
Medical Research. (1991) London England: Chapman and Hall).
Poor agreement = Less than 0.20
Fair agreement = 0.20 to 0.40
Moderate agreement = 0.40 to 0.60
Good agreement = 0.60 to 0.80
Very good agreement = 0.80 to 1.00
57
58. Other agreement measures
A (or M) and B (or N) are the two sets of extracted terms
C is the no. of terms common between two sets
59.
60. Common parse tree evaluation measures
Tree accuracy or Exact match
1 point if the parse tree is completely right (against the
gold standard), 0 otherwise
Strictest criterion
For many potential task, partly right parses are not
much use
things will not work very well in a database query system if one
gets the scope of operators wrong, and it does not help much that
the system got part of the parse tree right.
60
64. Parseval
Charniak shows that according to these measures, one
can do surprisingly well on parsing the Penn by
inducing a vanilla PCFG which ignores all lexical
content
Success on crossing brackets is helped by the fact that
Penn trees are quite flat.
To the extent that sentences have very few brackets in
them, the number of crossing brackets is likely to be
small.
64
65. Parseval
If there is a constituent that attaches very high (in a
complex right-branching sentence), but the parser by
mistake attaches it very low, then every node in the
right-branching complex will be wrong, seriously
damaging both precision and recall, whereas arguably
only a single mistake was made by the parser.
65
68. Types of evaluation
Exact match
This is the percentage of completely correctly parsed
sentences.
The same measure is also used for the evaluation of
constituent parsers.
Attachment score
This is the percentage of words that have the correct
head.
68
69. Attachment Score
The output of the gold standard is called key
The output of the candidate parser is called answer
Attachment score is the percentage of words
correctly identified in answer
69
70. Attachment Score
True Positives: Present in both output
False Positives: Present in answer but absent in key
False Negatives: Present in key but absent in answer
Gold Standard (key)Output Candidate (answer) output
70
71. Attachment Score
Then, we calculate precision, recall and F-score
When both the answer and the key are full parses, each of them have N
-1 dependencies, where N is the number of words in the sentence.
The precision and recall value will be the same.
If full parse is reported then the ratio between the number of correct dependencies
and the number of words was adopted as the evaluation metric.
71
72. Types of attachment score
Strict evaluation
Dependency, head and dependent- all must match
Useful when both of the parsers use same set of
dependency relations
Relaxed evaluation
Head and dependent must match but match with
dependency is optional
Some evaluations report the match of the head in a
dependency
Useful when the parsers use different set of dependency
relations
72
73. References
Enormous resources have been collected from Mr.
Google, son of Mrs. Web
Manning et al. Introduction to Information Retrieval.
Cambridge University Press. 2008
Manning and Schutze. Foundations of Statistical NLP.
The MIT Press. 1999
73