There are 3 boxes. B1 has 2 white, 3 black and 4 red balls. B2 has 3 white, 2 black and 2 red balls. B3 has 4 white, 1 black and 3 red balls. A box is chosen at random and 2 balls are drawn. 1 is white and other is red. What is the probability that they came from the first box??
The probabilistic model of NBC is to find the probability of a certain class given multiple dijoint (assumed) events.
The naïve Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f(x) can take on any value from some finite set V. A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values <a1,a2,…,an>. The learner is asked to predict the target value, or classification, for this new instance.
In Naïve Bayes Classifier we make the assumption of class conditional independence, that is given the class label of a sample, the value of the attributes are conditionally independent of one another.
However, there can be dependences between value of attributes. To avoid this we use Bayesian Belief Network which provide joint conditional probability distribution.
A Bayesian network is a form of probabilistic graphical model. Specifically, a Bayesian network is a directed acyclic graph of nodes representing variables and arcs representing dependence relations among the variables.
A Bayesian network is a representation of the joint distribution over all the variables represented by nodes in the graph. Let the variables be X(1), ..., X(n).
Let parents(A) be the parents of the node A. Then the joint distribution for X(1) through X(n) is represented as the product of the probability distributions P(Xi|Parents(Xi)) for i = 1 to n. If X has no parents, its probability distribution is said to be unconditional, otherwise it is conditional.
Our approach to representing arbitrary text documents is disturbingly simple: Given a text document, such as this paragraph, we define an attribute for each word position in the document and define the value of that attribute to be the English word found in that position . Thus, the current paragraph would be described by 111 attribute values, corresponding to the 111 word positions. The value of the first attribute is the word “our,” the value of the second attribute is the word “approach,” and so on. Notice that long text documents will require a larger number of attributes than short documents. As we shall see, this will not cause us any trouble.
We know (P(like) = .3 and P (dislike) = .7 in the current example
P(a i , = w k |v j ) (here we introduce w k to indicate the k th word in the English vocabulary)
estimating the class conditional probabilities (e.g., P(a i = “our”Idislike)) is more problematic because we must estimate one such probability term for each combination of text position, English word, and target value.
there are approximately 50,000 distinct words in the English vocabulary, 2 possible target values, and 111 text positions in the current example, so we must estimate 2*111* 50, 000 =~10 million such terms from the training data.
we make assumption that reduces the number of probabilities that must be estimated
we shall assume the probability of encountering a specific word w k (e.g., “chocolate”) is independent of the specific word position being considered (e.g., a23 versus a95) .
we estimate the entire set of probabilities P(a 1 = w k |v j ), P(a 2 = w k |v j )... by the single position-independent probability P(w k lv j )
net effect is that we now require only 2* 50, 000 distinct terms of the form P(w k lv j )
We adopt the rn-estimate, with uniform priors and with m equal to the size of the word vocabulary
n total number of word positions in all training examples whose target value is v, n k is the number of times word W k is found among these n word positions, and Vocabulary is the total number of distinct words (and other tokens) found within the training data.
Examples is a set of text documents along with their target values. V is the set of all possible target values. This function learns the probability terms P( w k | v j ), describing the probability that a randomly drawn word from a document in class v j will be the English word W k . It also learns the class prior probabilities P(v i ). 1. collect all words, punctuation, and other tokens that occur in Examples • Vocabulary set of all distinct words & tokens occurring in any text document from Examples 2. calculate the required P(v i ) and P( w k | v j ) probability terms • For each target value v j in V do • docs j the subset of documents from Examples for which the target value is v j • P(v1) Idocs j I / Examplesl • Text j a single document created by concatenating all members of docs j • n total number of distinct word positions in Text j • for each word W k in Vocabulary n k number of times word w k occurs in Text j • P(w k Iv j ) n k +1/n+|Vocabulary|
CLASSIFY_NAIVE_BAYES_TEXT( Doc) Return the estimated target value for the document Doc. a i denotes the word found in the i th position within Doc. • positions all word positions in Doc that contain tokens found in Vocabulary • Return V NB , where
During learning, the procedure LEARN_NAIVE_BAYES_TEXT examines all training documents to extract the vocabulary of all words and tokens that appear in the text, then counts their frequencies among the different target classes to obtain the necessary probability estimates. Later, given a new document to be classified, the procedure CLASSIFY_NAIVE_BAYESTEXT uses these probability estimates to calculate VNB according to Equation Note that any words appearing in the new document that were not observed in the training set are simply ignored by CLASSIFY_NAIVE_BAYESTEXT
target classification for an article name of the usenet newsgroup in which the article appeared
In the experiment described by Joachims (1996), 20 electronic newsgroups were considered
1,000 articles were collected from each newsgroup, forming a data set of 20,000 documents. The naive Bayes algorithm was then applied using two-thirds of these 20,000 documents as training examples, and performance was measured over the remaining third.
100 most frequent words were removed (these include words such as “the” and “of’), and any word occurring fewer than three times was also removed. The resulting vocabulary contained approximately 38,500 words.
A newsgroup posting service that learns to assign documents to the appropriate newsgroup.
NEWSWEEDER system—a program for reading netnews that allows the user to rate articles as he or she reads them. NEWSWEEDER then uses these rated articles (i.e its learned profile of user interests to suggest the most highly rated new articles each day
Naive Bayes Spam Filtering Using Word- Position-Based Attributes
Bayesian Learning Networks Approach to Cybercrime Detection N S ABOUZAKHAR, A GANI and G MANSON The Centre for Mobile Communications Research (C4MCR), University of Sheffield, Sheffield Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK [email_address] [email_address] [email_address] M ABUITBEL and D KING The Manchester School of Engineering, University of Manchester IT Building, Room IT 109, Oxford Road, Manchester M13 9PL, UK [email_address] [email_address]
Let’s look at the problem from the opposite direction. If we set the probability of portsweep attack to 100%,then the value of some associated variables would inevitably vary.
We note from Figure 4 that the probabilities of the TCP protocol and private service have been increased from 38.10% to 97.49% and from 24.71% to 71.45% respectively. Also, we can notice an increase in the REJ and RSTR flags.