This document discusses various evaluation metrics for binary and multi-class classification models. It explains that metrics are important for quantifying model performance, tracking progress, and debugging. For binary classifiers, it describes point metrics like accuracy, precision, recall from the confusion matrix. Summary metrics like AUROC and AUPRC are discussed as ways to evaluate models across all thresholds. The tradeoff between precision and recall is illustrated. Class imbalance issues and choosing appropriate metrics are also covered.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Learning machine learning with YellowbrickRebecca Bilbro
Yellowbrick is an open source Python library that provides visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Learning machine learning with YellowbrickRebecca Bilbro
Yellowbrick is an open source Python library that provides visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Assessing Model Performance - Beginner's GuideMegan Verbakel
Introduction on how to assess the performance of a classifier model. Covers theories (bias-variance trade-off, over/under-fitting), data preparation (train/test split, cross-validation), common performance plots (e.g. ROC curve and confusion matrix), and common metrics (e.g. accuracy, precision, recall, f1-score).
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
An introduction to SigmaXL's various Graphical tools
Established in 1998, SigmaXL Inc. is a leading provider of user friendly Excel Add-ins for Lean Six Sigma graphical and statistical tools and Monte Carlo simulation.
SigmaXL® customers include market leaders like Agilent, Diebold, FedEx, Microsoft, Motorola and Shell. SigmaXL® software is also used by numerous colleges, universities and government agencies.
Our flagship product, SigmaXL®, was designed from the ground up to be a cost-effective, powerful, but easy to use tool that enables users to measure, analyze, improve and control their service, transactional, and manufacturing processes. As an add-in to the already familiar Microsoft Excel, SigmaXL® is ideal for Lean Six Sigma training and application, or use in a college statistics course.
DiscoverSim™ enables you to quantify your risk through Monte Carlo simulation and minimize your risk with global optimization. Business decisions are often based on assumptions with a single point value estimate or an average, resulting in unexpected outcomes.
DiscoverSim™ allows you to model the uncertainty in your inputs so that you know what to expect in your outputs.
Application of Machine Learning in AgricultureAman Vasisht
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
AUC is and has been an extremely powerful lens through which machine learning practitioners have been able to evaluate and compare model performance. Is the phrase “my curve is better than your curve” the right threshold for publishing a new paper or pushing a new model into production? In this talk, I will demonstrate the ways in which we at Remitly are thinking outside the box (and the area under the curve) to challenge whether or not AUC is the right metric for a range of applications. Price and cost are fundamental components of economic modeling, and are quintessential aspects of an economist’s education and economic way of thinking. These are foreign concepts for many machine learning practitioners. Remitly’s Data Science team manages and thinks deeply about a number of classification tasks such as risk management and fraud detection. For a number of these tasks, misclassification is extremely costly compared to the gains of a correct classification. We are willing to sacrifice AUC in order to incorporate costs of classification and misclassification into our loss functions. By incorporating the notion of “indifference curves” (i.e., level sets), we show that by choosing models whose ROC curves cross our indifference curve thresholds, we can aim for models that give us the best bang for our buck.
Generalized Linear Regression with Gaussian Distribution is a statistical technique which is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The Generalized Linear Model (GLM) generalizes linear regression by allowing the linear model to be related to the response variable via a link function (in this case link function being Gaussian Distribution) and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
The dimensions of healthcare quality refer to various attributes or aspects that define the standard of healthcare services. These dimensions are used to evaluate, measure, and improve the quality of care provided to patients. A comprehensive understanding of these dimensions ensures that healthcare systems can address various aspects of patient care effectively and holistically. Dimensions of Healthcare Quality and Performance of care include the following; Appropriateness, Availability, Competence, Continuity, Effectiveness, Efficiency, Efficacy, Prevention, Respect and Care, Safety as well as Timeliness.
Assessing Model Performance - Beginner's GuideMegan Verbakel
Introduction on how to assess the performance of a classifier model. Covers theories (bias-variance trade-off, over/under-fitting), data preparation (train/test split, cross-validation), common performance plots (e.g. ROC curve and confusion matrix), and common metrics (e.g. accuracy, precision, recall, f1-score).
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
Beginners in machine learning usually presume that a proper assessment of a predictive model should simply comply with the golden rule of evaluation (split the data into train and test) in order to choose the most accurate model, which will hopefully behave well when deployed into production. However, things are more elaborate in the real world. The contexts in which a predictive model is evaluated and deployed can differ significantly, not coping well with the change, especially if the model has been evaluated with a performance metric that is insensitive to these changing contexts. A more comprehensive and reliable view of machine learning evaluation is illustrated with several common pitfalls and the tips addressing them, such as the use of probabilistic models, calibration techniques, imbalanced costs and visualisation tools such as ROC analysis.
Jose Hernandez Orallo, Ph.D. is a senior lecturer at Universitat Politecnica de Valencia. His research areas include: Data Mining and Machine Learning, Model re-framing, Inductive Programming and Data-Mining, and Intelligence Measurement and Artificial General Intelligence.
An introduction to SigmaXL's various Graphical tools
Established in 1998, SigmaXL Inc. is a leading provider of user friendly Excel Add-ins for Lean Six Sigma graphical and statistical tools and Monte Carlo simulation.
SigmaXL® customers include market leaders like Agilent, Diebold, FedEx, Microsoft, Motorola and Shell. SigmaXL® software is also used by numerous colleges, universities and government agencies.
Our flagship product, SigmaXL®, was designed from the ground up to be a cost-effective, powerful, but easy to use tool that enables users to measure, analyze, improve and control their service, transactional, and manufacturing processes. As an add-in to the already familiar Microsoft Excel, SigmaXL® is ideal for Lean Six Sigma training and application, or use in a college statistics course.
DiscoverSim™ enables you to quantify your risk through Monte Carlo simulation and minimize your risk with global optimization. Business decisions are often based on assumptions with a single point value estimate or an average, resulting in unexpected outcomes.
DiscoverSim™ allows you to model the uncertainty in your inputs so that you know what to expect in your outputs.
Application of Machine Learning in AgricultureAman Vasisht
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
AUC is and has been an extremely powerful lens through which machine learning practitioners have been able to evaluate and compare model performance. Is the phrase “my curve is better than your curve” the right threshold for publishing a new paper or pushing a new model into production? In this talk, I will demonstrate the ways in which we at Remitly are thinking outside the box (and the area under the curve) to challenge whether or not AUC is the right metric for a range of applications. Price and cost are fundamental components of economic modeling, and are quintessential aspects of an economist’s education and economic way of thinking. These are foreign concepts for many machine learning practitioners. Remitly’s Data Science team manages and thinks deeply about a number of classification tasks such as risk management and fraud detection. For a number of these tasks, misclassification is extremely costly compared to the gains of a correct classification. We are willing to sacrifice AUC in order to incorporate costs of classification and misclassification into our loss functions. By incorporating the notion of “indifference curves” (i.e., level sets), we show that by choosing models whose ROC curves cross our indifference curve thresholds, we can aim for models that give us the best bang for our buck.
Generalized Linear Regression with Gaussian Distribution is a statistical technique which is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The Generalized Linear Model (GLM) generalizes linear regression by allowing the linear model to be related to the response variable via a link function (in this case link function being Gaussian Distribution) and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.
The dimensions of healthcare quality refer to various attributes or aspects that define the standard of healthcare services. These dimensions are used to evaluate, measure, and improve the quality of care provided to patients. A comprehensive understanding of these dimensions ensures that healthcare systems can address various aspects of patient care effectively and holistically. Dimensions of Healthcare Quality and Performance of care include the following; Appropriateness, Availability, Competence, Continuity, Effectiveness, Efficiency, Efficacy, Prevention, Respect and Care, Safety as well as Timeliness.
Navigating Challenges: Mental Health, Legislation, and the Prison System in B...Guillermo Rivera
This conference will delve into the intricate intersections between mental health, legal frameworks, and the prison system in Bolivia. It aims to provide a comprehensive overview of the current challenges faced by mental health professionals working within the legislative and correctional landscapes. Topics of discussion will include the prevalence and impact of mental health issues among the incarcerated population, the effectiveness of existing mental health policies and legislation, and potential reforms to enhance the mental health support system within prisons.
ICH Guidelines for Pharmacovigilance.pdfNEHA GUPTA
The "ICH Guidelines for Pharmacovigilance" PDF provides a comprehensive overview of the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) guidelines related to pharmacovigilance. These guidelines aim to ensure that drugs are safe and effective for patients by monitoring and assessing adverse effects, ensuring proper reporting systems, and improving risk management practices. The document is essential for professionals in the pharmaceutical industry, regulatory authorities, and healthcare providers, offering detailed procedures and standards for pharmacovigilance activities to enhance drug safety and protect public health.
CRISPR-Cas9, a revolutionary gene-editing tool, holds immense potential to reshape medicine, agriculture, and our understanding of life. But like any powerful tool, it comes with ethical considerations.
Unveiling CRISPR: This naturally occurring bacterial defense system (crRNA & Cas9 protein) fights viruses. Scientists repurposed it for precise gene editing (correction, deletion, insertion) by targeting specific DNA sequences.
The Promise: CRISPR offers exciting possibilities:
Gene Therapy: Correcting genetic diseases like cystic fibrosis.
Agriculture: Engineering crops resistant to pests and harsh environments.
Research: Studying gene function to unlock new knowledge.
The Peril: Ethical concerns demand attention:
Off-target Effects: Unintended DNA edits can have unforeseen consequences.
Eugenics: Misusing CRISPR for designer babies raises social and ethical questions.
Equity: High costs could limit access to this potentially life-saving technology.
The Path Forward: Responsible development is crucial:
International Collaboration: Clear guidelines are needed for research and human trials.
Public Education: Open discussions ensure informed decisions about CRISPR.
Prioritize Safety and Ethics: Safety and ethical principles must be paramount.
CRISPR offers a powerful tool for a better future, but responsible development and addressing ethical concerns are essential. By prioritizing safety, fostering open dialogue, and ensuring equitable access, we can harness CRISPR's power for the benefit of all. (2998 characters)
India Clinical Trials Market: Industry Size and Growth Trends [2030] Analyzed...Kumar Satyam
According to TechSci Research report, "India Clinical Trials Market- By Region, Competition, Forecast & Opportunities, 2030F," the India Clinical Trials Market was valued at USD 2.05 billion in 2024 and is projected to grow at a compound annual growth rate (CAGR) of 8.64% through 2030. The market is driven by a variety of factors, making India an attractive destination for pharmaceutical companies and researchers. India's vast and diverse patient population, cost-effective operational environment, and a large pool of skilled medical professionals contribute significantly to the market's growth. Additionally, increasing government support in streamlining regulations and the growing prevalence of lifestyle diseases further propel the clinical trials market.
Growing Prevalence of Lifestyle Diseases
The rising incidence of lifestyle diseases such as diabetes, cardiovascular diseases, and cancer is a major trend driving the clinical trials market in India. These conditions necessitate the development and testing of new treatment methods, creating a robust demand for clinical trials. The increasing burden of these diseases highlights the need for innovative therapies and underscores the importance of India as a key player in global clinical research.
CHAPTER 1 SEMESTER V PREVENTIVE-PEDIATRICS.pdfSachin Sharma
This content provides an overview of preventive pediatrics. It defines preventive pediatrics as preventing disease and promoting children's physical, mental, and social well-being to achieve positive health. It discusses antenatal, postnatal, and social preventive pediatrics. It also covers various child health programs like immunization, breastfeeding, ICDS, and the roles of organizations like WHO, UNICEF, and nurses in preventive pediatrics.
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...The Lifesciences Magazine
Deep Leg Vein Thrombosis occurs when a blood clot forms in one or more of the deep veins in the legs. These clots can impede blood flow, leading to severe complications.
3. Why are metrics important?
- Training objective (cost function) is only a proxy for real world objective.
- Metrics help capture a business goal into a quantitative target (not all errors
are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time (No Free Lunch Theorem).
- Useful for lower level tasks and debugging (like diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still,
metrics are useful and important for evaluation.
4. Binary Classification
● X is Input
● Y is binary Output (0/1)
● Model is ŷ = h(X)
● Two types of models
○ Models that output a categorical class directly (K Nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
5. Score based models
Score = 1
Positive labelled example
Negative labelled example
Score = 0
Prevalence =
#positives
#positives + #negatives
23. Summary metrics: Log-Loss motivation
Score = 1
Score = 0
Score = 1
Score = 0
Two models scoring the same data set. Is one of them better than the other?
Model A Model B
24. Summary metrics: Log-Loss
● These two model outputs have same ranking, and
therefore the same AU-ROC, AU-PRC, accuracy!
● Gain =
● Log loss rewards confident correct answers and
heavily penalizes confident wrong answers.
● exp(E[log-loss]) is G.M. of gains, in [0,1].
● One perfectly confident wrong prediction is fatal.
● Gaining popularity as an evaluation metric (Kaggle)
Score = 1
Score = 0
Score = 1
Score = 0
26. Unsupervised Learning
● logP(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis)
○ High logP(x) on training set, but low logP(x) on test set is a measure of overfitting
○ Raw value of logP(x) hard to interpret in isolation
● K-means is trickier (because of fixed covariance assumption)
27. Class Imbalance: Problems
Symptom: Prevalence < 5% (no strict definition)
Metrics: may not be meaningful.
Learning: may not focus on minority class examples at all (majority class can overwhelm
logistic regression, to a lesser extent SVM)
28. Class Imbalance: Metrics (pathological cases)
Accuracy: Blindly predict majority class.
Log-Loss: Majority class can dominate the loss.
AUROC: Easy to keep AUC high by scoring most negatives very low.
AUPRC: Somewhat more robust than AUROC. But other challenges.
- What kind of interpolation? AUCNPR?
In general: Accuracy << AUROC << AUPRC
29. Multi-class (few remarks)
● Confusion matrix will be NxN (still want heavy diagonals, light off-diagonals)
● Most metrics (except accuracy) generally analysed as multiple 1-vs-many.
● Multiclass variants of AUROC and AUPRC (micro vs macro averaging)
● Class imbalance is common (both in absolute, and relative sense)
● Cost sensitive learning techniques (also helps in Binary Imbalance)
○ Assign $$ value for each block in the confusion matrix, and incorporate those into the loss
function.
30. Choosing Metrics
Some common patterns:
- High precision is hard constraint, do best recall (e.g search engine results,
grammar correction) -- intolerant to FP. Metric: Recall at Precision=XX%
- High recall is hard constraint, do best precision (e.g medical diag). Intolerant
to FN. Metric: Precision at Recall=100%
- Capacity constrained (by K). Metric: Precision in top-K.
- Etc.
- Choose operating threshold based on above criteria.
Dots are examples: Green = positive label, gray = negative label.
Consider score to be probability output of Logistic Regression.
Position of example on scoring line based on model output. For most of the metrics, only the relative rank / ordering of positive/negative matter.
Prevalence decides class imbalance.
If too many examples, plot class-wise histogram instead.
Rank view useful to start error analysis (start with the topmost negative example, or bottom-most positive example).
Only after picking a threshold, we have a classifier.
Once we have a classifier, we can obtain all the Point metrics of that classifier.
All point metrics can be derived from the confusion matrix. Confusion matrix captures all the information about a classifier performance, but is not a scalar!
Properties:
Total Sum is Fixed (population).
Column Sums are Fixed (class-wise population).
Quality of model & selected threshold decide how columns are split into rows.
We want diagonals to be “heavy”, off diagonals to be “light”.
TP = Green examples correctly identified as green.
TN = Gray examples correctly identified as gray. “Trueness” is along the diagonal.
FP = Gray examples falsely identified as green.
FN = Green examples falsely identified as gray.
Accuracy = what fraction of all predictions did we get right?
Improving accuracy is equivalent to having a 0-1 Loss function.
Precision = quality of positive predictions.
Also called PPV (Positive Predictive Value).
Recall (sensitivity) - What fraction of all greens did we pick out? Terminology from lab tests: how sensitive is the test in detecting disease?
Somewhat related to Precision ( both recall and precision involve TP)
Trivial 100% recall = pull everybody above the threshold.
Trivial 100% precision = push everybody below the threshold except 1 green on top (hopefully no gray above it!).
Striving for good precision with 100% recall = pulling up the lowest green as high as possible in the ranking.
Striving for good recall with 100% precision = pushing down the top most gray as low as possible in the ranking.
Negative Recall (Specificity) - What fraction of all negatives in the population were we able to keep away?
Note: Sensitivity and specificity operate in different sub-universes.
Summarize precision and recall into single score
F1-score: Harmonic mean of PR and REC
G-score: Geometric mean of PR and REC
Changing the threshold can result in a new confusion matrix, and new values for some of the metrics.
Many threshold values are redundant (between two consecutively ranked examples). Number of effective thresholds = M + 1 (# of examples + 1)
As we scan through all possible effective thresholds, we explore all the possible values the metrics can take on for the given model.
Each row is specific to the threshold.
Table is specific to the model (different model = different effective thresholds).
Observations: Sensitivity monotonic, Specificity monotonic in opposite direction. Precision somewhat correlated with specificity, and not monotonic. Tradeoff between Sensitivity and Specificity, Precision and Recall.
We may not want to pick a threshold up front, but still want to summarize the overall model performance.
AUC = Area Under Curve. Also called C-Statistic (concordance score). Represents how well the results are ranked. If you pick a positive e.g by random, and negative e.g by random, AUC = probability that positive eg is ranked > negative example. Measure of quality of discrimination.
Some people plot TPR (= Sensitivity) vs FPR (= 1 - Specificity).
Thresholds are points on this curve. Each point represents one set of point metrics.
Diagonal line = random guessing.
Grid view (top left to bottom right, postive = move right, negative = move down)
Precision Recall curve represents a different tradeoff. Think search engine results ranked.
End of curve at right cannot be lower than prevalence.
Area under PRC = Average Precision. Intuition: By randomly picking the threshold, what’s the expected precision?
While sensitivity monotonically decreased with increasing recall (the number of correctly classified negative examples can only go down), precision can be “jagged”. Getting a bunch of positives in the sequence increases both precision and recall (hence curve climbs up slowly), but getting a bunch of negatives in the sequence drops the precision without increasing recall. Hence, by definition, curve has only slow climbs and vertical drops.
Somehow the second model feels “better”.
The metrics so far fail to capture this.
Accuracy optimizes for: Model gets $1 for correct answer, $0 for wrong.
Log-loss: Model makes a bet for $0.xy dollars that it’s prediction is correct. If correct, gets $0.xy dollars, else gets $(1-0.xy) dollar. Incentive structure is well suited to produce calibrated models. Take home is the GM of earned amounts.
Low-bar for exp(E[log-loss]) is 0.5
Log-loss on validation set measure how much did this “betting engine” earn.
View calibration plot and histogram together
When model is well calibrated, then histogram indicates level of discrimination
If model is not calibrated, histogram may not mean much at all (see SVC plot)
Brier score is another measure of calibration. Both brier score and log-loss are Proper Scoring Rules.
For accuracy, prevalence of majority class should be the low bar!
AUROC: 10% prevalence. top-10% are all negatives, next K are all positives, followed by rest of the negatives. AUPRC = 0.9 even though practically useless.