This document provides an overview of nearest neighbor and Bayesian classifiers for machine learning. It discusses how nearest neighbor classifiers work by finding the k closest training examples in attribute space and classifying new examples based on the class of its neighbors. It also explains naive Bayesian classification, which uses Bayes' theorem and assumes attribute independence to calculate class probabilities. Examples are given of how to apply these techniques to classification problems.
An overview of gradient descent optimization algorithms Hakky St
勾配降下法についての論文をスライドにしたものです。
This is the slide for study meeting of gradient descent.
I use this paper and this is very good information about gradient descent.
https://arxiv.org/abs/1609.04747
Linear regression with gradient descentSuraj Parmar
Intro to the very popular optimization Technique(Gradient descent) with linear regression . Linear regression with Gradient descent on www.landofai.com
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This power point presentation will give you the knowledge of merge sort algorithm how it works with a given problem solving example. It also describe about the time complexity of merge sort algorithm, and the program in c .
Control Strategies
Control Strategy in Artificial Intelligence
scenario is a technique or strategy, tells us about which rule has to be applied next while searching for the solution of a problem within problem space.
It helps us to decide which rule has to apply next without getting stuck at any point.
Characteristics of Control Strategies
A good Control strategy has two main
characteristics:
Control Strategy should cause Motion
Control strategy should be Systematic
Co ntrol Strategy should cause Motion
Each rule or strategy applied should cause the motion because if there will be no motion than such control strategy will never lead to a solution. Motion states about the change of state and if a state will not change then there be no movement from an initial state and we would never solve the problem.
Co ntrol Strategy should be Systematic
Though the strategy applied should create the
motion but if do not follow some systematic
strategy than we are likely to reach the same state
number of times before reaching the solution
which increases the number of steps. Taking care of only first strategy we may go through particular useless sequences of operators several times. Control Strategy should be systematic implies a need for global motion as well as for local motion.
An overview of gradient descent optimization algorithms Hakky St
勾配降下法についての論文をスライドにしたものです。
This is the slide for study meeting of gradient descent.
I use this paper and this is very good information about gradient descent.
https://arxiv.org/abs/1609.04747
Linear regression with gradient descentSuraj Parmar
Intro to the very popular optimization Technique(Gradient descent) with linear regression . Linear regression with Gradient descent on www.landofai.com
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This power point presentation will give you the knowledge of merge sort algorithm how it works with a given problem solving example. It also describe about the time complexity of merge sort algorithm, and the program in c .
Control Strategies
Control Strategy in Artificial Intelligence
scenario is a technique or strategy, tells us about which rule has to be applied next while searching for the solution of a problem within problem space.
It helps us to decide which rule has to apply next without getting stuck at any point.
Characteristics of Control Strategies
A good Control strategy has two main
characteristics:
Control Strategy should cause Motion
Control strategy should be Systematic
Co ntrol Strategy should cause Motion
Each rule or strategy applied should cause the motion because if there will be no motion than such control strategy will never lead to a solution. Motion states about the change of state and if a state will not change then there be no movement from an initial state and we would never solve the problem.
Co ntrol Strategy should be Systematic
Though the strategy applied should create the
motion but if do not follow some systematic
strategy than we are likely to reach the same state
number of times before reaching the solution
which increases the number of steps. Taking care of only first strategy we may go through particular useless sequences of operators several times. Control Strategy should be systematic implies a need for global motion as well as for local motion.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Marina Santini
In this lecture, we talk about two different discriminative machine learning methods: decision trees and k-nearest neighbors. Decision trees are hierarchical structures.k-nearest neighbors are based on two principles: recollection and resemblance.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
K-Nearest neighbor is one of the most commonly used classifier based in lazy learning. It is one of the most commonly used methods in recommendation systems and document similarity measures. It mainly uses Euclidean distance to find the similarity measures between two data points.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. References 2
Jiawei Han and Micheline Kamber, "Data Mining, : Concepts
and Techniques", The Morgan Kaufmann Series in Data
Management Systems (Second Edition).
Tom M. Mitchell. “Machine Learning” McGraw Hill 1997
Pang-Ning Tan, Michael Steinbach, Vipin Kumar,
“Introduction to Data Mining”, Addison Wesley.
Ian H. Witten, Eibe Frank. “Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations”
2nd Edition.
Prof. Pier Luca Lanzi
3. Apples again! 3
Is this an apple?
Prof. Pier Luca Lanzi
4. A Simple Way to Classify a New Fruit 4
To decide whether an unseen fruit is an apple, consider
the k fruits that are more similar to the unknown fruit
Then, classify the unknown fruit using the most
frequent class
Prof. Pier Luca Lanzi
5. A Simple Way to Classify a New Fruit 5
If k=5, these 5 fruits are the most similar ones
to the unclassified example
Since, the majority of the fruits are apples, we decide that
the unknown fruit is an apple
Prof. Pier Luca Lanzi
6. Instance-Based Classifiers 6
Store the training records
Use the training records to predict
the class label of unseen cases
Att1 … Attn Class
Att1 … Attn
Prof. Pier Luca Lanzi
7. Instance-based methods 7
Instance-based methods are the simplest form of learning
Training instances are searched for instance that most
closely resembles new instance
The instances themselves represent the knowledge
Similarity (distance) function defines what’s “learned”
“lazy evaluation”, until a new instance must be classified
Instance-based learning is lazy learning
Prof. Pier Luca Lanzi
8. Instance Based Classifiers 8
Rote-learner
Memorizes entire training data
Performs classification only if attributes of record
match one of the training examples exactly
Nearest neighbor
Uses k “closest” points (nearest neighbors)
for performing classification
Prof. Pier Luca Lanzi
9. Rote learning 9
Rote Learning is basically memorization
Saving knowledge so it can be used again
Retrieval is the only problem
No repeated computation, inference or query is necessary
A simple example of rote learning is caching
Store computed values (or large piece of data)
Recall this information when required by computation
Significant time savings can be achieved
Many AI programs (as well as more general ones) have
used caching very effectively
Prof. Pier Luca Lanzi
10. Nearest Neighbor Classifiers 10
Basic idea:
If it walks like a duck, quacks like a duck, then it’s
probably a duck
Compute
Test
Distance
Record
Training Choose k of the
Records “nearest” records
Prof. Pier Luca Lanzi
11. Nearest-Neighbor Classifiers 11
Requires three things
Unknown record
The set of stored records
Distance Metric to compute
distance between records
The value of k, the number
of nearest neighbors to
retrieve
To classify an unknown record:
Compute distance to other
training records
Identify k nearest neighbors
Use class labels of nearest
neighbors to determine the
class label of unknown
record (e.g., by taking
majority vote)
Prof. Pier Luca Lanzi
12. Definition of Nearest Neighbor 12
X
X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x
Prof. Pier Luca Lanzi
14. What distance? 14
To compute the similarity between two points, use for instance the
Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the k-nearest
neighbors
Weigh the vote according to distance, e.g.,
the weight factor, w = 1/d2
Taking the square root is not required when comparing distances
Another popular metric is city-block (Manhattan) metric,
distance is the sum of absolute differences
Prof. Pier Luca Lanzi
15. Normalization and other issues 15
Different attributes are measured on different scales
Attributes need to be normalized:
vi − min vi vi − Avg ( vi )
ai = ai =
max vi − min vi StDev ( vi )
vi : the actual value of attribute i
Nominal attributes: distance either 0 or 1
Common policy for missing values: assumed to be maximally
distant (given normalized attributes)
Prof. Pier Luca Lanzi
16. Nearest Neighbor Classification… 16
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other
classes
Prof. Pier Luca Lanzi
17. Nearest Neighbor Classification: 17
Scaling Issues
Attributes may have to be scaled to prevent distance
measures from being dominated by one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
Prof. Pier Luca Lanzi
18. Nearest Neighbor Classification 18
Problem with Euclidean measure:
High dimensional data, curse of dimensionality
Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
• Solution: Normalize the vectors to unit length
Prof. Pier Luca Lanzi
19. Nearest neighbor Classification 19
k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree induction and
rule-based systems
Classifying unknown records are relatively expensive
Prof. Pier Luca Lanzi
20. The distance function 20
Simplest case: one numeric attribute
Distance is the difference between the two attribute
values involved (or a function thereof)
Several numeric attributes: normally, Euclidean distance is
used and attributes are normalized
Nominal attributes: distance is set to 1 if values are
different, 0 if they are equal
Are all attributes equally important?
Weighting the attributes might be necessary
Prof. Pier Luca Lanzi
21. Discussion of Nearest-neighbor 21
Often very accurate
But slow, since simple version scans entire training data to derive a
prediction
Assumes all attributes are equally important, thus may need
attribute selection or weights
Possible remedies against noisy instances:
Take a majority vote over the k nearest neighbors
Removing noisy instances from dataset (difficult!)
Statisticians have used k-NN since early 1950s
If n → ∞ and k/n → 0, error approaches minimum
Prof. Pier Luca Lanzi
22. Case-Based Reasoning (CBR) 22
Also uses and lazy evaluation and analyzes similar instances
But instances are not “points in a Euclidean space”
Methodology
Instances represented by rich symbolic descriptions (e.g.,
function graphs)
Multiple retrieved cases may be combined
Tight coupling between case retrieval,
knowledge-based reasoning, and problem solving
Research issues
Indexing based on syntactic similarity measure, and
when failure, backtracking, and adapting to
additional cases
Prof. Pier Luca Lanzi
23. Lazy Learning vs. Eager Learning 23
Instance-based learning is based on lazy evaluation
Decision-tree and Bayesian classification are
based on eager evaluation
Key differences
Lazy method may consider query instance xq when
deciding how to generalize beyond the training data D
Eager method cannot since they have already chosen
global approximation when seeing the query
Prof. Pier Luca Lanzi
24. Lazy Learning vs. Eager Learning 24
Efficiency
Lazy learning needs less time for training
but more time predicting
Accuracy
Lazy methods effectively use a richer hypothesis space
since it uses many local linear functions to form its
implicit global approximation to the target function
Eager methods must commit to a single hypothesis that
covers the entire instance space
Prof. Pier Luca Lanzi
25. Example: PEBLS 25
PEBLS: Parallel Examplar-Based Learning System
(Cost & Salzberg)
Works with both continuous and nominal features
For nominal features, distance between two nominal values
is computed using modified value difference metric
Each record is assigned a weight factor
Number of nearest neighbor, k = 1
Prof. Pier Luca Lanzi
26. Example: PEBLS 26
Tid Refund Marital Taxable
Distance between nominal attribute values:
Income Cheat
Status
d(Single,Married) = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1
No
1 Yes Single 125K
d(Single,Divorced) = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0
No
2 No Married 100K
d(Married,Divorced) = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1
No
3 No Single 70K
d(Refund=Yes,Refund=No) =
No
4 Yes Married 120K
| 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7
Yes
5 No Divorced 95K
No
6 No Married 60K
n1i n 2 i
∑
No
7 Yes Divorced 220K
d (V1 ,V2 ) = −
Yes
8 No Single 85K
n1 n 2
i
No
9 No Married 75K
Yes
10 No Single 90K
10
Refund
Marital Status Class
Class Yes No
Single Married Divorced
Yes 0 3
Yes 2 0 1
No 3 4
No 2 4 1
Prof. Pier Luca Lanzi
27. Example: PEBLS 27
Tid Refund Marital Taxable
Income Cheat
Status
X Yes Single 125K No
Y No Married 100K No
10
Distance between record X and record Y:
d
Δ ( X , Y ) = w X wY ∑ d ( X i , Yi ) 2
i =1
Number of times X is used for prediction
where:
wX =
Number of times X predicts correctly
wX ≅ 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions
Prof. Pier Luca Lanzi
28. Bayes Classifier 28
A probabilistic framework for solving classification problems
Conditional Probability:
Bayes theorem,
A priori probability of C: P(C)
Probability of event before evidence is seen
A posteriori probability of C: P(C|A)
Probability of event after evidence is seen
Prof. Pier Luca Lanzi
29. Naïve Bayes for classification 29
What’s the probability of the class given an instance?
Evidence E = instance, represented as a tuple of attributes
<e1, …, en>
Event H = class value for instance
We are looking for the class value with
the highest probability for E
I.e., we are looking for the hypothesis that has the highest
probability to explain the evidence E
Prof. Pier Luca Lanzi
30. Naïve Bayes for classification 30
Given the hypothesis H and the example E described
by n attributes, Bayes Theorem says that
Naïve assumption: attributes are statistically independent
Evidence splits into parts that are independent
Prof. Pier Luca Lanzi
31. Naïve Bayes for classification 31
Training: the class probability P(H) and the conditional
probability P(ei|H) for each attribute value ei and
each class value H
Testing: given an instance E, the class is computed as
Prof. Pier Luca Lanzi
32. Bayesian (Statistical) modeling 32
“Opposite” of OneRule: use all the attributes
Two assumptions
Attributes are equally important
Attribute are statistically independent
I.e., knowing the value of one attribute says nothing about
the value of another (if the class is known)
Independence assumption is almost never correct!
But this scheme works well in practice
Prof. Pier Luca Lanzi
33. Using the Probabilities for Classification 33
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
What is the value of “play”?
Prof. Pier Luca Lanzi
34. Using the Probabilities for Classification 34
Outlook Temp. Humidity Windy Play
Sunny Cool High True ?
Conversion into a probability by normalization:
Prof. Pier Luca Lanzi
35. The “zero-frequency problem” 35
What if an attribute value does not occur
with every class value?
For instance, “Humidity = high” for class “yes”
Probability will be zero!
A posteriori probability will also be zero!
(No matter how likely the other values are!)
Remedy: add 1 to the count for every attribute value-class
combination (Laplace estimator)
Result: probabilities will never be zero!
(also: stabilizes probability estimates)
Prof. Pier Luca Lanzi
36. M-Estimate of Conditional Probability 36
n number of examples with class H
nc number of examples with class H and attribute e
The m-estimate for P(e|H) is computed as,
Where m is a parameter known as the equivalent sample size
p is a user specified parameter
If n is zero, then P(e|H) = p, thus p is
the prior probability of observing e
p =1/k given k values of the attribute considered
Prof. Pier Luca Lanzi
37. Missing values 37
Training: instance is not included in frequency count for
attribute value-class combination
Testing: attribute will be omitted from calculation
Example:
Outlook Temp. Humidity Windy Play
? Cool High True ?
Likelihood of “yes” = 3/9 × 3/9 × 3/9 × 9/14 = 0.0238
Likelihood of “no” = 1/5 × 4/5 × 3/5 × 5/14 = 0.0343
P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41%
P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%
Prof. Pier Luca Lanzi
38. Numeric attributes 38
Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class)
Karl Gauss, 1777-1855
great German mathematician
Prof. Pier Luca Lanzi
39. Numeric attributes 39
Usual assumption: attributes have a normal or Gaussian
probability distribution (given the class)
The probability density function for the normal distribution is
defined by two parameters:
Sample mean,
Standard deviation,
Then the density function f(x) is
Prof. Pier Luca Lanzi
40. Statistics for Weather data 40
with numeric temperature
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
μ =73 μ =75 μ =79 μ =86
Sunny 2/9 3/5 False 6/9 2/5 9/14 5/14
σ =6.2 σ =7.9 σ =10.2 σ =9.7 True
Overcast 4/9 0/5 3/9 3/5
Rainy 3/9 2/5
Example density value:
Prof. Pier Luca Lanzi
41. Classifying a new day 41
A new day,
Outlook Temp. Humidity Windy Play
Sunny 66 90 true ?
Missing values during training are not included in calculation
of mean and standard deviation
Likelihood of “yes” = 2/9 × 0.0340 × 0.0221 × 3/9 × 9/14 = 0.000036
Likelihood of “no” = 3/5 × 0.0291 × 0.0380 × 3/5 × 5/14 = 0.000136
P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9%
P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1%
Prof. Pier Luca Lanzi
42. Naïve Bayes: discussion 42
Naïve Bayes works surprisingly well
(even if independence assumption is clearly violated)
Why? Because classification doesn’t require accurate
probability estimates as long as maximum probability is
assigned to correct class
However: adding too many redundant attributes will cause
problems (e.g. identical attributes)
Also, many numeric attributes are not normally distributed
Prof. Pier Luca Lanzi
43. Bayesian Belief Networks 43
The conditional independence assumption,
makes computation possible
yields optimal classifiers when satisfied
but is seldom satisfied in practice, as attributes
(variables) are often correlated
Bayesian Belief Networks (BBN) allows us to specify which
pair of attributes are conditionally independent
They provide a graphical representation of probabilistic
relationships among a set of random variables
Prof. Pier Luca Lanzi
44. Bayesian Belief Networks 44
Describe the probability distribution governing a set of
variables by specifying
Conditional independence assumptions that apply on
subsets of the variables
A set of conditional probabilities
Two key elements
A direct acyclic graph, encoding the dependence
relationships among variables
A probability table associating each node to its immediate
parents node
Prof. Pier Luca Lanzi
45. Examples of Bayesian Belief Networks 45
a) Variable A and B are independent, but each one of them has
an influence on the variable C
b) A is conditionally independent of both B and D, given C
c) The configuration of the typical naïve Bayes classifier
Prof. Pier Luca Lanzi
46. Probability Tables 46
The network topology imposes conditions regarding the
variable conditional independence
Each node is associated with a probability table
If a node X does not have any parents, then
the table contains only the prior probability P(X)
If a node X has only one any parent Y, then
the table contains only the conditional probability P(X|Y)
If a node X has multiple parents, Y1, …, Yk the
the table contains the conditional probability P(X|Y1…Yk)
Prof. Pier Luca Lanzi
49. Model Building 49
Let T = {X1, X2, …, Xd} denote a total
order of the variables
for j = 1 to d do
Let XT(j) denote the jth highest order variable in T
Let π(XT(j)) = {XT(1) , XT(j-1)} denote the variables
preceding XT(j)
Remove the variables from π(XT(j)) that do not affect XT(j)
Create an arc between XT(j) and the remaining variables in
π(XT(j))
done
Guarantees a topology that does not contain cycles
Different orderings generate different networks
In principles d! possible networks to test
Prof. Pier Luca Lanzi
50. Bayesian Belief Networks 50
evidence, observed
unobserved
to be predicted
In general the inference is NP-complete but there are
approximating methods, e.g. Monte-Carlo
Prof. Pier Luca Lanzi
51. Bayesian Belief Networks 51
Bayesian belief network allows a subset of the variables
conditionally independent
A graphical model of causal relationships
Several cases of learning Bayesian belief networks
Given both network structure and
all the variables is easy
Given network structure but only some variables
When the network structure is not known in advance
Prof. Pier Luca Lanzi
52. Summary of Bayesian Belief Networks 52
BBN provides an approach for capturing prior knowledge of a
particular domain using a graphical model
Building the network is time consuming, but
adding variables to a network is straightforward
They can encode causal dependencies among variables
They are well suited to dealing with incomplete data
They are robust to overfitting
Prof. Pier Luca Lanzi