Naive Bayes is a simple probabilistic classifier that applies Bayes' theorem with strong (naive) independence assumptions. It is often effective in practice even when the assumptions are not strictly true. The document discusses the naive Bayes classifier and its assumptions, how to estimate parameters from data, and how to classify new examples by calculating conditional probabilities. It also covers important considerations like dealing with small probabilities, evaluating performance using metrics like sensitivity and specificity, and using cross-validation to estimate accuracy on new data.
- The Naive Bayes classifier is a probabilistic classifier that applies Bayes' theorem with strong (naive) independence assumptions.
- It estimates the probability of a class given attribute values by calculating the product of the individual attribute probabilities multiplied by the class prior probability.
- Training involves counting occurrences of attribute-value pairs in the training data to estimate probabilities, with smoothing to address zero counts. Classification then predicts the class with highest posterior probability for new examples.
- Naive Bayes is a probabilistic classifier that applies Bayes' theorem with strong (naive) independence assumptions.
- It estimates the probability of a class given attribute values by calculating the product of the individual attribute probabilities.
- Training involves counting occurrences of attribute-value pairs in each class to estimate the necessary probabilities.
This document summarizes key concepts from the CS 221 lecture on machine learning. It discusses supervised learning techniques like Naive Bayes classification, linear regression, perceptrons, and SVMs. It also covers unsupervised learning through k-nearest neighbors and discusses challenges like overfitting, generalization, and the curse of dimensionality.
Supervised learning: Types of Machine LearningLibya Thomas
This document discusses machine learning concepts including supervised and unsupervised learning, prediction, diagnosis, and discovery. It provides examples of using naive Bayes classifiers for spam filtering and digit recognition. For spam filtering, it shows how to represent emails as bags-of-words and learn word probabilities from labeled training emails. It also discusses issues with overfitting and the need for smoothing techniques like Laplace smoothing when estimating probabilities. For digit recognition, it outlines representing images as feature vectors over pixel values and using a naive Bayes model to classify images.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
- The Naive Bayes classifier is a probabilistic classifier that applies Bayes' theorem with strong (naive) independence assumptions.
- It estimates the probability of a class given attribute values by calculating the product of the individual attribute probabilities multiplied by the class prior probability.
- Training involves counting occurrences of attribute-value pairs in the training data to estimate probabilities, with smoothing to address zero counts. Classification then predicts the class with highest posterior probability for new examples.
- Naive Bayes is a probabilistic classifier that applies Bayes' theorem with strong (naive) independence assumptions.
- It estimates the probability of a class given attribute values by calculating the product of the individual attribute probabilities.
- Training involves counting occurrences of attribute-value pairs in each class to estimate the necessary probabilities.
This document summarizes key concepts from the CS 221 lecture on machine learning. It discusses supervised learning techniques like Naive Bayes classification, linear regression, perceptrons, and SVMs. It also covers unsupervised learning through k-nearest neighbors and discusses challenges like overfitting, generalization, and the curse of dimensionality.
Supervised learning: Types of Machine LearningLibya Thomas
This document discusses machine learning concepts including supervised and unsupervised learning, prediction, diagnosis, and discovery. It provides examples of using naive Bayes classifiers for spam filtering and digit recognition. For spam filtering, it shows how to represent emails as bags-of-words and learn word probabilities from labeled training emails. It also discusses issues with overfitting and the need for smoothing techniques like Laplace smoothing when estimating probabilities. For digit recognition, it outlines representing images as feature vectors over pixel values and using a naive Bayes model to classify images.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
The document discusses probability models and how they can be used to make inferences from data. It introduces concepts like random variables, probability distributions, conditional probability, Bayes' rule, and Markov models. It provides examples of how to compute probabilities both mathematically and through simulation. The goal is to choose a probability model that best explains the data and make predictions from that model.
This document provides an introduction to statistical modeling and machine learning concepts. It discusses:
- What statistical modeling and machine learning are, including training models on data and evaluating them.
- Common statistical models like Gaussian, Bernoulli, and Multinomial distributions.
- Supervised learning tasks like regression and classification, and unsupervised clustering.
- Key concepts like overfitting, evaluation metrics, and issues with modeling like black swans and the long tail.
Nonparametric statistics show up in all sorts of places with fuzzy, ranked, or labeled data. The techniques allow handling messy data with more robust results than assuming normality. This talk describes the basics of nonparametric analysis and shows some examples with the Kolomogrov-Smirnov test, one of the most commonly used.
The document discusses Naive Bayes classifiers, which are a family of algorithms for classification that are based on Bayes' theorem and assume independence between features. It provides definitions of key terms like conditional probability and Bayes' theorem. It then derives the Naive Bayes classifier equation and discusses how it works, including an example of classifying whether to play golf based on weather conditions. The document also covers advantages like speed, disadvantages like the independence assumption, and applications like spam filtering.
UNIT2_NaiveBayes algorithms used in machine learningmichaelaaron25322
This document provides an overview of Naive Bayes classifiers and Support Vector Machines (SVMs). It discusses the key concepts behind Naive Bayes, including Bayes' theorem and the assumption of independence among predictors. It also explains how Naive Bayes classifiers work and are implemented in scikit-learn. Specifically, it covers Bernoulli, Multinomial, and Gaussian Naive Bayes models. The document then discusses SVMs, focusing on linear and kernel-based classification as well as support vector regression. Examples of applications of Naive Bayes and pros/cons of the approach are also summarized.
Probability theory provides a framework for quantifying and manipulating uncertainty. It allows optimal predictions given incomplete information. The document outlines key probability concepts like sample spaces, events, axioms of probability, joint/conditional probabilities, and Bayes' rule. It also covers important probability distributions like binomial, Gaussian, and multivariate Gaussian. Finally, it discusses optimization concepts for machine learning like functions, derivatives, and using derivatives to find optima like maxima and minima.
The document discusses Naive Bayes classifiers and Support Vector Machines (SVM). It explains the concepts of Naive Bayes, including Bayes' theorem and the assumptions of Naive Bayes. It provides examples of using Naive Bayes for classification, including calculating posterior probabilities. It also discusses different types of Naive Bayes classifiers like Bernoulli, Multinomial, and Gaussian Naive Bayes. Finally, it briefly introduces the concept of Support Vector Machines.
Covering important topics of Classical Machine Learning in 16 hours, in preparation for the following 10 weeks of Deep Learning courses at Taiwan AI academy from 2018/02-2018/05. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
- Bayesian statistics provides a natural way to answer questions about probabilities by using Bayes' Theorem to update prior probabilities with new evidence or data.
- Markov chain Monte Carlo (MCMC) methods like Gibbs sampling allow Bayesian inference to be performed for complex models by sampling from the posterior distribution.
- Examples show how Bayesian analysis can be used to compare groups, estimate parameters, handle outliers, and perform regression while avoiding assumptions like normality that are required in frequentist statistics.
This document provides an introduction to Bayesian statistics through examples and explanations. It discusses why the Bayesian approach is more natural than the frequentist approach for answering questions about probabilities. Examples show how thinking like a Bayesian versus a frequentist leads to different conclusions. The document also explains Bayes' Theorem and how Markov chain Monte Carlo (MCMC) methods like Gibbs sampling allow practical application of Bayesian statistics when integrals are intractable. Overall, the document provides an overview of key Bayesian concepts using clear examples.
This document provides an overview of the structure and topics covered in a course on machine learning and data science. The course will first focus on supervised learning techniques like linear regression and classification. The second half will cover unsupervised learning including clustering and dimension reduction. Optional deep learning topics may be included if time permits. Key aspects of data science like the data analysis process and probabilistic concepts such as Bayes' rule and conditional independence are also reviewed.
The document discusses key concepts related to probability and random variables including:
- Random variables can be discrete or continuous depending on whether their outcomes come from a finite set of possibilities or vary along a continuous scale.
- Probability distributions like the binomial, Poisson, normal and uniform describe the probabilities associated with different outcomes of a random variable.
- Important properties of distributions include the mean, variance, skewness, and kurtosis.
- The normal distribution is widely used as it approximates many natural phenomena and the probability of events can be found by calculating the associated area under its probability density function curve.
Basic statistics for algorithmic tradingQuantInsti
In this presentation we try to understand the core basics of statistics and its application in algorithmic trading.
We start by defining what statistics is. Collecting data is the root of statistics. We need data to analyse and take quantitative decisions.
While analyzing, there are certain parameters for statistics, this branches statistics into two - descriptive statistics & inferential statistics.
This data that we have collected can be classified into uni-variate and bi-variate. It also tries to explain the fundamental difference.
Going Further we also cover topics like regression line, Coefficient of Determination, Homoscedasticity and Heteroscedasticity.
In this way the presentation look at various aspects of statistics which are used for algorithmic trading.
To learn the advanced applications of statistics for HFT & Quantitative Trading connect with us one our website: www.quantinsti.com.
This document discusses different types of data and statistical concepts. It begins by describing the major types of data: numerical, categorical, and ordinal. Numerical data represents quantitative measurements, categorical data has no inherent mathematical meaning, and ordinal data has categorical categories with a mathematical order. It then discusses statistical measures like the mean, median, mode, standard deviation, variance, percentiles, moments, covariance, correlation, conditional probability, and Bayes' theorem. Examples are provided to help explain each concept.
Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance, Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM, Naive Bayes
This document describes how to perform a chi-square test to determine if two genes are independently assorting or linked. It explains that for a two-point testcross of a heterozygote individual, you expect a 25% ratio for each of the four possible offspring genotypes if the genes are independent. The chi-square test compares observed vs. expected offspring ratios. It notes that the standard test assumes equal segregation of alleles, which may not always be true.
A primer in Data Analysis. To substantiate the concepts, I presented Python code in the form of an ipython notebook (not included - get in touch for these, email and twitter are on the last slide).
The talk starts by describing general data analysis (and skills required). I then speak about computing descriptive statistics and explain the details of two types of predictive models (simple linear regression and naive Bayes classifiers). We build examples using both predictive models using python (Pandas and Matplotlib).
This document provides an overview of Bayes classifiers and Naive Bayes classifiers. It begins by introducing probabilistic classification and the goal of predicting a categorical output given input attributes. It then discusses the Naive Bayes assumption that attributes are conditionally independent given the class label. This allows estimating each p(attribute|class) separately rather than the full joint distribution. The document covers maximum likelihood estimation, Laplace smoothing, and using Naive Bayes for problems like spam filtering. It contrasts generative models like Naive Bayes that model p(attribute|class) with discriminative approaches that directly model p(class|attributes).
This document discusses Naive Bayes classifiers and k-nearest neighbors (kNN) algorithms. It begins with an overview of Naive Bayes, including how it makes strong independence assumptions between attributes. Several examples are provided to illustrate Naive Bayes classification. The document then covers kNN, explaining that it is an instance-based learning method that classifies new examples based on their similarity to training examples. Parameters like the number of neighbors k and distance metrics are discussed.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
This document provides an introduction to statistical modeling and machine learning concepts. It discusses:
- What statistical modeling and machine learning are, including training models on data and evaluating them.
- Common statistical models like Gaussian, Bernoulli, and Multinomial distributions.
- Supervised learning tasks like regression and classification, and unsupervised clustering.
- Key concepts like overfitting, evaluation metrics, and issues with modeling like black swans and the long tail.
Nonparametric statistics show up in all sorts of places with fuzzy, ranked, or labeled data. The techniques allow handling messy data with more robust results than assuming normality. This talk describes the basics of nonparametric analysis and shows some examples with the Kolomogrov-Smirnov test, one of the most commonly used.
The document discusses Naive Bayes classifiers, which are a family of algorithms for classification that are based on Bayes' theorem and assume independence between features. It provides definitions of key terms like conditional probability and Bayes' theorem. It then derives the Naive Bayes classifier equation and discusses how it works, including an example of classifying whether to play golf based on weather conditions. The document also covers advantages like speed, disadvantages like the independence assumption, and applications like spam filtering.
UNIT2_NaiveBayes algorithms used in machine learningmichaelaaron25322
This document provides an overview of Naive Bayes classifiers and Support Vector Machines (SVMs). It discusses the key concepts behind Naive Bayes, including Bayes' theorem and the assumption of independence among predictors. It also explains how Naive Bayes classifiers work and are implemented in scikit-learn. Specifically, it covers Bernoulli, Multinomial, and Gaussian Naive Bayes models. The document then discusses SVMs, focusing on linear and kernel-based classification as well as support vector regression. Examples of applications of Naive Bayes and pros/cons of the approach are also summarized.
Probability theory provides a framework for quantifying and manipulating uncertainty. It allows optimal predictions given incomplete information. The document outlines key probability concepts like sample spaces, events, axioms of probability, joint/conditional probabilities, and Bayes' rule. It also covers important probability distributions like binomial, Gaussian, and multivariate Gaussian. Finally, it discusses optimization concepts for machine learning like functions, derivatives, and using derivatives to find optima like maxima and minima.
The document discusses Naive Bayes classifiers and Support Vector Machines (SVM). It explains the concepts of Naive Bayes, including Bayes' theorem and the assumptions of Naive Bayes. It provides examples of using Naive Bayes for classification, including calculating posterior probabilities. It also discusses different types of Naive Bayes classifiers like Bernoulli, Multinomial, and Gaussian Naive Bayes. Finally, it briefly introduces the concept of Support Vector Machines.
Covering important topics of Classical Machine Learning in 16 hours, in preparation for the following 10 weeks of Deep Learning courses at Taiwan AI academy from 2018/02-2018/05. Topics include regression (linear, polynomial, gaussian and sigmoid basis functions), dimension reduction (PCA, LDA, ISOMAP), clustering (K-means, GMM, Mean-Shift, DBSCAN, Spectral Clustering), classification (Naive Bayes, Logistic Regression, SVM, kNN, Decision Tree, Classifier Ensembles, Bagging, Boosting, Adaboost) and Semi-Supervised learning techniques. Emphasis on sampling, probability, curse of dimensionality, decision theory and classifier generalizability.
- Bayesian statistics provides a natural way to answer questions about probabilities by using Bayes' Theorem to update prior probabilities with new evidence or data.
- Markov chain Monte Carlo (MCMC) methods like Gibbs sampling allow Bayesian inference to be performed for complex models by sampling from the posterior distribution.
- Examples show how Bayesian analysis can be used to compare groups, estimate parameters, handle outliers, and perform regression while avoiding assumptions like normality that are required in frequentist statistics.
This document provides an introduction to Bayesian statistics through examples and explanations. It discusses why the Bayesian approach is more natural than the frequentist approach for answering questions about probabilities. Examples show how thinking like a Bayesian versus a frequentist leads to different conclusions. The document also explains Bayes' Theorem and how Markov chain Monte Carlo (MCMC) methods like Gibbs sampling allow practical application of Bayesian statistics when integrals are intractable. Overall, the document provides an overview of key Bayesian concepts using clear examples.
This document provides an overview of the structure and topics covered in a course on machine learning and data science. The course will first focus on supervised learning techniques like linear regression and classification. The second half will cover unsupervised learning including clustering and dimension reduction. Optional deep learning topics may be included if time permits. Key aspects of data science like the data analysis process and probabilistic concepts such as Bayes' rule and conditional independence are also reviewed.
The document discusses key concepts related to probability and random variables including:
- Random variables can be discrete or continuous depending on whether their outcomes come from a finite set of possibilities or vary along a continuous scale.
- Probability distributions like the binomial, Poisson, normal and uniform describe the probabilities associated with different outcomes of a random variable.
- Important properties of distributions include the mean, variance, skewness, and kurtosis.
- The normal distribution is widely used as it approximates many natural phenomena and the probability of events can be found by calculating the associated area under its probability density function curve.
Basic statistics for algorithmic tradingQuantInsti
In this presentation we try to understand the core basics of statistics and its application in algorithmic trading.
We start by defining what statistics is. Collecting data is the root of statistics. We need data to analyse and take quantitative decisions.
While analyzing, there are certain parameters for statistics, this branches statistics into two - descriptive statistics & inferential statistics.
This data that we have collected can be classified into uni-variate and bi-variate. It also tries to explain the fundamental difference.
Going Further we also cover topics like regression line, Coefficient of Determination, Homoscedasticity and Heteroscedasticity.
In this way the presentation look at various aspects of statistics which are used for algorithmic trading.
To learn the advanced applications of statistics for HFT & Quantitative Trading connect with us one our website: www.quantinsti.com.
This document discusses different types of data and statistical concepts. It begins by describing the major types of data: numerical, categorical, and ordinal. Numerical data represents quantitative measurements, categorical data has no inherent mathematical meaning, and ordinal data has categorical categories with a mathematical order. It then discusses statistical measures like the mean, median, mode, standard deviation, variance, percentiles, moments, covariance, correlation, conditional probability, and Bayes' theorem. Examples are provided to help explain each concept.
Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and distributions, Variance, Distribution properties and arithmetic, Samples/CLT, Basic machine learning algorithms, Linear regression, SVM, Naive Bayes
This document describes how to perform a chi-square test to determine if two genes are independently assorting or linked. It explains that for a two-point testcross of a heterozygote individual, you expect a 25% ratio for each of the four possible offspring genotypes if the genes are independent. The chi-square test compares observed vs. expected offspring ratios. It notes that the standard test assumes equal segregation of alleles, which may not always be true.
A primer in Data Analysis. To substantiate the concepts, I presented Python code in the form of an ipython notebook (not included - get in touch for these, email and twitter are on the last slide).
The talk starts by describing general data analysis (and skills required). I then speak about computing descriptive statistics and explain the details of two types of predictive models (simple linear regression and naive Bayes classifiers). We build examples using both predictive models using python (Pandas and Matplotlib).
This document provides an overview of Bayes classifiers and Naive Bayes classifiers. It begins by introducing probabilistic classification and the goal of predicting a categorical output given input attributes. It then discusses the Naive Bayes assumption that attributes are conditionally independent given the class label. This allows estimating each p(attribute|class) separately rather than the full joint distribution. The document covers maximum likelihood estimation, Laplace smoothing, and using Naive Bayes for problems like spam filtering. It contrasts generative models like Naive Bayes that model p(attribute|class) with discriminative approaches that directly model p(class|attributes).
This document discusses Naive Bayes classifiers and k-nearest neighbors (kNN) algorithms. It begins with an overview of Naive Bayes, including how it makes strong independence assumptions between attributes. Several examples are provided to illustrate Naive Bayes classification. The document then covers kNN, explaining that it is an instance-based learning method that classifies new examples based on their similarity to training examples. Parameters like the number of neighbors k and distance metrics are discussed.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
2. Things We’d Like to Do
• Spam Classification
– Given an email, predict whether it is spam or not
• Medical Diagnosis
– Given a list of symptoms, predict whether a patient has
disease X or not
• Weather
– Based on temperature, humidity, etc… predict if it will rain
tomorrow
4. Another Application
• Digit Recognition
• X1,…,Xn {0,1} (Black vs. White pixels)
• Y {5,6} (predict whether a digit is a 5 or a 6)
Classifier 5
5. The Bayes Classifier
• A good strategy is to predict:
– (for example: what is the probability that the image
represents a 5 given its pixels?)
• So … How do we compute that?
6. The Bayes Classifier
• Use Bayes Rule!
• Why did this help? Well, we think that we might be able to
specify how features are “generated” by the class label
Normalization Constant
Likelihood Prior
7. The Bayes Classifier
• Let’s expand this for our digit recognition task:
• To classify, we’ll simply compute these two probabilities and predict based
on which one is greater
8. Model Parameters
• For the Bayes classifier, we need to “learn” two functions, the
likelihood and the prior
• How many parameters are required to specify the prior for
our digit recognition example?
9. Model Parameters
• How many parameters are required to specify the likelihood?
– (Supposing that each image is 30x30 pixels)
?
10. Model Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is that
there are usually way too many parameters:
– We’ll run out of space
– We’ll run out of time
– And we’ll need tons of training data (which is usually not
available)
11. The Naïve Bayes Model
• The Naïve Bayes Assumption: Assume that all features are
independent given the class label Y
• Equationally speaking:
• (We will discuss the validity of this assumption later)
12. Why is this useful?
• # of parameters for modeling P(X1,…,Xn|Y):
2(2n-1)
• # of parameters for modeling P(X1|Y),…,P(Xn|Y)
2n
13. Naïve Bayes Training
• Now that we’ve decided to use a Naïve Bayes classifier, we need to train it
with some data:
MNIST Training Data
14. Naïve Bayes Training
• Training in Naïve Bayes is easy:
– Estimate P(Y=v) as the fraction of records with Y=v
– Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which Xi=u
• (This corresponds to Maximum Likelihood estimation of model
parameters)
15. Naïve Bayes Training
• In practice, some of these counts can be zero
• Fix this by adding “virtual” counts:
– (This is like putting a prior on parameters and doing MAP
estimation instead of MLE)
– This is called Smoothing
16. Naïve Bayes Training
• For binary digits, training amounts to averaging all of the training fives
together and all of the training sixes together.
18. Another Example of the Naïve Bayes
Classifier
The weather data, with counts and probabilities
outlook temperature humidity windy play
yes no yes no yes no yes no yes no
sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5
overcast 4 0 mild 4 2 normal 6 1 true 3 3
rainy 3 2 cool 3 1
sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5
rainy 3/9 2/5 cool 3/9 1/5
A new day
outlook temperature humidity windy play
sunny cool high true ?
19. • Likelihood of yes
• Likelihood of no
• Therefore, the prediction is No
0053
.
0
14
9
9
3
9
3
9
3
9
2
0206
.
0
14
5
5
3
5
4
5
1
5
3
20. The Naive Bayes Classifier for Data
Sets with Numerical Attribute Values
• One common practice to handle numerical
attribute values is to assume normal
distributions for numerical attributes.
21. The numeric weather data with summary statistics
outlook temperature humidity windy play
yes no yes no yes no yes no yes no
sunny 2 3 83 85 86 85 false 6 2 9 5
overcast 4 0 70 80 96 90 true 3 3
rainy 3 2 68 65 80 70
64 72 65 95
69 71 70 91
75 80
75 70
72 90
81 75
sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14
overcast 4/9 0/5 std
dev
6.2 7.9 std
dev
10.2 9.7 true 3/9 3/5
rainy 3/9 2/5
22. • Let x1, x2, …, xn be the values of a numerical attribute
in the training data set.
2
2
2
1
)
(
1
1
1
1
2
1
w
e
w
f
x
n
x
n
n
i
i
n
i
i
24. Outputting Probabilities
• What’s nice about Naïve Bayes (and generative models in
general) is that it returns probabilities
– These probabilities can tell us how confident the algorithm
is
– So… don’t throw away those probabilities!
25. Performance on a Test Set
• Naïve Bayes is often a good choice if you don’t have much training data!
26. Naïve Bayes Assumption
• Recall the Naïve Bayes assumption:
– that all features are independent given the class label Y
• Does this hold for the digit recognition problem?
27. Exclusive-OR Example
• For an example where conditional independence fails:
– Y=XOR(X1,X2)
X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
28. • Actually, the Naïve Bayes assumption is almost never true
• Still… Naïve Bayes often performs surprisingly well even when
its assumptions do not hold
29. Numerical Stability
• It is often the case that machine learning algorithms need to
work with very small numbers
– Imagine computing the probability of 2000 independent
coin flips
– MATLAB thinks that (.5)2000=0
30. Underflow Prevention
• Multiplying lots of probabilities
floating-point underflow.
• Recall: log(xy) = log(x) + log(y),
better to sum logs of probabilities rather
than multiplying probabilities.
31. Underflow Prevention
• Class with highest final un-normalized log
probability score is still the most probable.
positions
i
j
i
j
C
c
NB c
x
P
c
P
c )
|
(
log
)
(
log
argmax
j
33. Recovering the Probabilities
• What if we want the probabilities though??
• Suppose that for some constant K, we have:
– And
• How would we recover the original probabilities?
34. Recovering the Probabilities
• Given:
• Then for any constant C:
• One suggestion: set C such that the greatest i is shifted to
zero:
See https://stats.stackexchange.com/questions/105602/example-of-how-the-log-sum-exp-trick-works-in-naive-bayes?noredirect=1&lq=1
35. Recap
• We defined a Bayes classifier but saw that it’s intractable to
compute P(X1,…,Xn|Y)
• We then used the Naïve Bayes assumption – that everything
is independent given the class label Y
• A natural question: is there some happy compromise where
we only assume that some features are conditionally
independent?
– Stay Tuned…
36. Conclusions
• Naïve Bayes is:
– Really easy to implement and often works well
– Often a good first thing to try
– Commonly used as a “punching bag” for smarter
algorithms
39. Types of errors
• But suppose that
– The 95% is the correctly classified pixels
– Only 5% of the pixels are actually edges
– It misses all the edge pixels
• How do we count the effect of different types
of error?
40. Types of errors
Prediction
Edge Not edge
True
Positive
False
Negative
False
Positive
True
Negative
Ground
Truth
Not
Edge
Edge
41. True Positive
Two parts to each: whether you got it correct or not, and what
you guessed. For example for a particular pixel, our guess
might be labelled…
Did we get it correct?
True, we did get it
correct.
False Negative
Did we get it correct?
False, we did not get it
correct.
or maybe it was labelled as one of the others, maybe…
What did we say?
We said ‘positive’, i.e. edge.
What did we say?
We said ‘negative, i.e. not
edge.
42. Sensitivity and Specificity
Count up the total number of each label (TP, FP, TN, FN) over a large
dataset. In ROC analysis, we use two statistics:
Sensitivity =
TP
TP+FN
Specificity =
TN
TN+FP
Can be thought of as the likelihood of
spotting a positive case when
presented with one.
Or… the proportion of edges we find.
Can be thought of as the likelihood of
spotting a negative case when
presented with one.
Or… the proportion of non-edges that
we find
43. Sensitivity = = ?
TP
TP+FN
Specificity = = ?
TN
TN+FP
Prediction
Ground
Truth
1
1 0
0
60 30
20
80
80+20 = 100 cases
in the dataset were
class 0 (non-edge)
60+30 = 90 cases in
the dataset were class
1 (edge)
90+100 = 190 examples
(pixels) in the data overall
44. The ROC space
1 - Specificity
Sensitivity
This is edge detector B
This is edge detector A
1.0
0.0 1.0
Note
45. The ROC Curve
Draw a ‘convex hull’ around many points:
1 - Specificity
Sensitivity This point is not
on the convex
hull.
46. ROC Analysis
1 - specificity
sensitivity
All the optimal detectors lie
on the convex hull.
Which of these is best
depends on the ratio of
edges to non-edges, and
the different cost of
misclassification
Any detector on this side
can lead to a better
detector by flipping its
output.
Take-home point : You should always quote sensitivity and specificity for
your algorithm, if possible plotting an ROC graph. Remember also though,
any statistic you quote should be an average over a suitable range of tests for
your algorithm.
47. Holdout estimation
• What to do if the amount of data is limited?
• The holdout method reserves a certain amount for
testing and uses the remainder for training
Usually: one third for testing, the rest for training
48. Holdout estimation
• Problem: the samples might not be representative
– Example: class might be missing in the test data
• Advanced version uses stratification
– Ensures that each class is represented with
approximately equal proportions in both subsets
49. Repeated holdout method
• Repeat process with different subsamples
more reliable
– In each iteration, a certain proportion is randomly
selected for training (possibly with stratificiation)
– The error rates on the different iterations are
averaged to yield an overall error rate
50. Repeated holdout method
• Still not optimum: the different test sets overlap
– Can we prevent overlapping?
– Of course!
51. Cross-validation
• Cross-validation avoids overlapping test sets
– First step: split data into k subsets of equal size
– Second step: use each subset in turn for testing, the
remainder for training
• Called k-fold cross-validation
52. Cross-validation
• Often the subsets are stratified before the cross-
validation is performed
• The error estimates are averaged to yield an
overall error estimate
53. More on cross-validation
• Standard method for evaluation: stratified ten-fold
cross-validation
• Why ten?
– Empirical evidence supports this as a good choice to get an
accurate estimate
– There is also some theoretical evidence for this
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
– E.g. ten-fold cross-validation is repeated ten times and
results are averaged (reduces the variance)
54. Leave-One-Out cross-validation
• Leave-One-Out:
a particular form of cross-validation:
– Set number of folds to number of training instances
– I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive
– (exception: NN)
55. Leave-One-Out-CV and stratification
• Disadvantage of Leave-One-Out-CV: stratification is
not possible
– It guarantees a non-stratified sample because there is only
one instance in the test set!
56. Hands-on Example
# Import Bayes.csv from class webpage
# Select training data
traindata <- Bayes[1:14,]
# Select test data
testdata <- Bayes[15,]
57. Construct Naïve Bayes Classifier the
hard way
# Calculate the Prior for Play
Pplay <- table(traindata$Play)
Pplay <- Pplay/sum(Pplay)
# Calculate P(Sunny | Play)
sunny <- table(traindata[,c("Play", "Sunny")])
sunny <- sunny/rowSums(sunny)
58. # Calculate P(Hot | Play)
hot <- table(traindata[,c("Play", "Hot")])
hot <- hot/rowSums(hot)
# and Calculate P(Windy | Play)
windy <- table(traindata[,c("Play", "Windy")])
windy <- windy/rowSums(windy)
59. # Evaluate testdata
Pyes <- sunny["Yes","Yes"] * hot["Yes","No"] * windy["Yes","Yes"]
Pno <- sunny["No","Yes"] * hot["No","No"] * windy["No","Yes"]
# Do we play or not?
Max(Pyes, Pno)
60. # Do it again, but use the naiveBayes package
# install the package if you don’t already have it
install.packages("e1071")
#load package
library(e1071)
#train model
m <- naiveBayes(traindata[,1:3], traindata[,4])
#evaluate testdata
predict(m,testdata[,1:3])
61. # use the naïveBayes classifier on the iris data
m <- naiveBayes(iris[,1:4], iris[,5])
table(predict(m, iris[,1:4]), iris[,5])