Classification Chapter# 05 Data Mining The Web By  Hussain Ahmad  M.S (Semantic Web) University of Peshawar Pakistan
..Classification.. <ul><li>In clustering we use the document class labels for evaluation purposes. </li></ul><ul><li>In cl...
..Classification.. <ul><li>This mapping process is called  classification. </li></ul><ul><li>The general framework for cla...
..Classification.. <ul><li>Step 1: Data collection and preprocessing. </li></ul><ul><ul><li>Documents are  </li></ul></ul>...
..Classification.. <ul><li>Step 2: Building the model.   </li></ul><ul><ul><li>This is the actual learning (also called  t...
..Classification.. <ul><li>Step 3: Testing and evaluating the model </li></ul><ul><ul><li>At this step the model is applie...
Web Documents exhibit some specific properties <ul><li>The web documents exhibit some specific properties which may requir...
Nearest-Neighbor Algorithm <ul><li>The nearest-neighbor algorithm is a straightforward application of similarity (or dista...
..NN Algorithm.. <ul><li>The closeness is measured by minimal distance or maximal similarity. </li></ul><ul><li>The most c...
..NN Algorithm.. <ul><li>Let us consider department document collection, represented as TFIDF vectors with six attributes ...
 
 
..NN Algorithm.. <ul><li>The 1-NN approach simply picks the most similar document i.e.  Criminal Justice  and uses its lab...
..NN Algorithm.. <ul><li>Therefore using 1-NN, </li></ul><ul><ul><li>Two assumptions are made </li></ul></ul><ul><ul><ul><...
..NN Algorithm.. <ul><li>Distance-weighted k-NN </li></ul><ul><ul><li>For example, the distance-weighted 3-NN with the sim...
FEATURE SELECTION <ul><li>The objective of feature selection is to find a subset of attributes that best describe a set of...
Naive Bayes Algorithm <ul><li>Bayesian classification </li></ul><ul><ul><li>Approaches: </li></ul></ul><ul><ul><ul><li>One...
 
..Naive Bayes Algorithm.. <ul><li>Classifying Theatre document given the rest of documents with known class labels. </li><...
..Naive Bayes Algorithm.. <ul><li>Now to find the class of the Theatre document, we compute the conditional probability of...
..Naive Bayes Algorithm.. <ul><li>To calculate each of the probabilities above, we take the proportion of the correspondin...
..Naive Bayes Algorithm.. <ul><li>The probabilities of classes A and B are estimated with the proportion of documents in e...
..Naive Bayes Algorithm.. <ul><li>Although the Boolean naive Bayes algorithm uses all training documents but it ignores th...
..Naive Bayes Algorithm.. <ul><li>And the probability with which term  ti  occurs in all documents from class C as  P ( ti...
 
..Naive Bayes Algorithm.. <ul><li>First we calculate the probabilities  P ( ti  | C)] </li></ul><ul><li>For example, this ...
..Naive Bayes Algorithm.. <ul><li>A common approach to avoid this problem is to use the  Laplace estimator . </li></ul><ul...
..Naive Bayes Algorithm.. <ul><li>Now we compute the probabilities of each term </li></ul><ul><li>given each class using t...
Numerical Approaches <ul><li>In the TFIDF vector space framework, we use  cosine similarity  as a measure of document simi...
..Numerical Approaches.. <ul><li>Linear regression is a standard technique for numerical prediction. </li></ul><ul><li>It ...
..Numerical Approaches.. <ul><li>The objective is to find the coefficients  wi  given a number of training instances  xi  ...
..Numerical Approaches.. <ul><li>Then the task is to find seven coefficients  w 0 ,w 1 , . . . , w 6 which satisfy a syste...
RELATIONAL   LEARNING <ul><li>All classification methods that we have discussed so far are based solely on the document co...
..RELATIONAL   LEARNING.. <ul><li>Relational learning   extends content-based approach to a relational representation. </l...
Upcoming SlideShare
Loading in …5
×

Classification Of Web Documents

1,922 views

Published on

classification

Published in: Education
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
1,922
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
70
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Classification Of Web Documents

  1. 1. Classification Chapter# 05 Data Mining The Web By Hussain Ahmad M.S (Semantic Web) University of Peshawar Pakistan
  2. 2. ..Classification.. <ul><li>In clustering we use the document class labels for evaluation purposes. </li></ul><ul><li>In classification they are an essential part of the input to the learning system. </li></ul><ul><li>The objective of the system is to create a mapping (also called a model or hypothesis ) between a set of documents and a set of class labels. </li></ul><ul><li>This mapping is then used to determine automatically the class of new (unlabeled) documents </li></ul>
  3. 3. ..Classification.. <ul><li>This mapping process is called classification. </li></ul><ul><li>The general framework for classification includes the model creation phase and other steps. </li></ul><ul><li>Therefore, the general framework is usually called supervised learning (also, learning from examples , concept learning ) and includes the following steps: </li></ul>
  4. 4. ..Classification.. <ul><li>Step 1: Data collection and preprocessing. </li></ul><ul><ul><li>Documents are </li></ul></ul><ul><ul><ul><li>Collected, cleaned, and properly organized, the terms (features) identified, and a vector space representation created. </li></ul></ul></ul><ul><ul><li>Documents are organized in classes (categories), based on their topic, user preference, or any other criterion. </li></ul></ul><ul><ul><li>Data divided into two subsets: </li></ul></ul><ul><ul><ul><li>Training set . This part of the data will be used to create the model. </li></ul></ul></ul><ul><ul><ul><li>Test set . This part of the data is used for testing the model. </li></ul></ul></ul>
  5. 5. ..Classification.. <ul><li>Step 2: Building the model. </li></ul><ul><ul><li>This is the actual learning (also called training ) step, which includes the use of the learning algorithm. </li></ul></ul><ul><ul><li>It is usually an iterative and interactive process that may include other steps and may be repeated several times so that the best model is created: </li></ul></ul><ul><ul><ul><li>Feature selection </li></ul></ul></ul><ul><ul><ul><li>Applying the learning algorithm </li></ul></ul></ul><ul><ul><ul><li>Validating the model (using the validation subset to tune some parameters of the learning algorithm) </li></ul></ul></ul>
  6. 6. ..Classification.. <ul><li>Step 3: Testing and evaluating the model </li></ul><ul><ul><li>At this step the model is applied to the documents from the test set and their actual class labels are compared to the labels predicted. </li></ul></ul><ul><li>Step 4: Using the model to classify new documents (with unknown class labels). </li></ul>
  7. 7. Web Documents exhibit some specific properties <ul><li>The web documents exhibit some specific properties which may require some adjustment or use of proper learning algorithms. </li></ul><ul><li>Here are the basic ones </li></ul><ul><ul><li>Text and web documents include thousands of words. </li></ul></ul><ul><ul><li>The document features inherit some of the properties of the natural language text from which they are derived. </li></ul></ul><ul><ul><li>Documents are of different sizes. </li></ul></ul>
  8. 8. Nearest-Neighbor Algorithm <ul><li>The nearest-neighbor algorithm is a straightforward application of similarity (or distance) for the purposes of classification. </li></ul><ul><li>It predicts the class of a new document using the class label of the closest document from the training set. </li></ul><ul><li>Because it uses just one instance from the training set, this basic version of the algorithm is called </li></ul><ul><li>- one-nearest neighbor (1-NN). </li></ul>
  9. 9. ..NN Algorithm.. <ul><li>The closeness is measured by minimal distance or maximal similarity. </li></ul><ul><li>The most common approach is to use the TFIDF ( term frequency–inverse document frequency ) framework to represent both the test and training documents and to compute the cosine similarity between the document vectors. </li></ul>
  10. 10. ..NN Algorithm.. <ul><li>Let us consider department document collection, represented as TFIDF vectors with six attributes along with the class labels for each document, as shown in Table. </li></ul><ul><li>Assume that the class of the Theatre document is unknown. </li></ul><ul><li>To determine the class of this document, we compute the cosine similarity between the Theatre vector and all other vectors. </li></ul>
  11. 13. ..NN Algorithm.. <ul><li>The 1-NN approach simply picks the most similar document i.e. Criminal Justice and uses its label B to predict the class of Theatre . </li></ul><ul><li>However, if we look at the nearest neighbor of Theatre (Criminal Justice) we see only one nonzero attribute, which produces the prediction. </li></ul><ul><li>This makes the algorithm extremely sensitive to noise and irrelevant attributes </li></ul>
  12. 14. ..NN Algorithm.. <ul><li>Therefore using 1-NN, </li></ul><ul><ul><li>Two assumptions are made </li></ul></ul><ul><ul><ul><li>There is no noise, and </li></ul></ul></ul><ul><ul><ul><li>All attributes are equally important for the classification. </li></ul></ul></ul><ul><li>k-NN is a generalization of 1-NN </li></ul><ul><ul><li>The parameter k is selected to be a small odd number (usually, 3 or 5) </li></ul></ul><ul><ul><li>For example, 3-NN </li></ul></ul><ul><ul><ul><li>Classify Theatre as of class B, because this is the majority label in the top three documents (B,A,B). </li></ul></ul></ul><ul><ul><ul><li>5-NN will predict class A, because the set of labels of the top five documents is { B,A,B,A,A } </li></ul></ul></ul>
  13. 15. ..NN Algorithm.. <ul><li>Distance-weighted k-NN </li></ul><ul><ul><li>For example, the distance-weighted 3-NN with the simplest weighting scheme [sim ( X,Y )] will predict class B for the Theatre document . </li></ul></ul><ul><ul><ul><li>Because the weight for label B (documents Criminal Justice and Communication) is </li></ul></ul></ul><ul><ul><ul><ul><li>B = 0 . 967075 + 0 . 605667 = 1 . 572742, </li></ul></ul></ul></ul><ul><ul><ul><li>while the weight for Anthropology is </li></ul></ul></ul><ul><ul><ul><ul><li>A = 0 . 695979, </li></ul></ul></ul></ul><ul><ul><ul><li>And thus B > A. </li></ul></ul></ul>
  14. 16. FEATURE SELECTION <ul><li>The objective of feature selection is to find a subset of attributes that best describe a set of documents with respect to the classification task i.e., the attributes with which the learning algorithm achieves maximal accuracy. </li></ul><ul><li>A simple solution is to try all subsets and pick the one that maximizes accuracy. </li></ul><ul><li>This solution is impractical, due to the huge number of subsets that have to be investigated (2 n for n attributes). </li></ul>
  15. 17. Naive Bayes Algorithm <ul><li>Bayesian classification </li></ul><ul><ul><li>Approaches: </li></ul></ul><ul><ul><ul><li>One based on the Boolean document representation and </li></ul></ul></ul><ul><ul><ul><li>Another based on document representation by term counts. </li></ul></ul></ul><ul><ul><li>Consider the set of Boolean document vectors shown in Table. </li></ul></ul>
  16. 19. ..Naive Bayes Algorithm.. <ul><li>Classifying Theatre document given the rest of documents with known class labels. </li></ul><ul><li>The Bayesian approach determines the class of document x as the one that maximizes the conditional probability P ( C | x ). According to Bayes’ rule, </li></ul><ul><li>Given that x is a vector of n attribute values </li></ul><ul><li>[i.e., x = ( x 1 , x 2 , . . . , xn )], this assumption leads to: </li></ul>
  17. 20. ..Naive Bayes Algorithm.. <ul><li>Now to find the class of the Theatre document, we compute the conditional probability of class A and class B given that this document has already occurred. For class A we have </li></ul>
  18. 21. ..Naive Bayes Algorithm.. <ul><li>To calculate each of the probabilities above, we take the proportion of the corresponding attribute value in class A. </li></ul><ul><li>For example, in the science column we have 0’s in four documents out of 11 from class A. Thus, P (science = 0 | A) = 4 / 11. </li></ul><ul><li>For class B we obtain </li></ul>
  19. 22. ..Naive Bayes Algorithm.. <ul><li>The probabilities of classes A and B are estimated with the proportion of documents in each </li></ul><ul><ul><li>P (A) = 11 / 19 = 0 . 578947 and </li></ul></ul><ul><ul><li>P (B) = 8 / 19 = 0 . 421053 </li></ul></ul><ul><li>Putting all this in the Bayes formula: </li></ul><ul><li>At this point we can make the decision that Theatre belongs to class A </li></ul>
  20. 23. ..Naive Bayes Algorithm.. <ul><li>Although the Boolean naive Bayes algorithm uses all training documents but it ignores the term counts. </li></ul><ul><li>Bayesian model based on term counts will classify our test document. </li></ul><ul><ul><li>Assume that there are m terms t 1 , t 2 , . . . , tm. </li></ul></ul><ul><ul><li>and n documents d 1 , d 2 , . . . , dn from class C. </li></ul></ul><ul><ul><li>Let us denote the number of times that term ti occurs in document dj as nij. </li></ul></ul>
  21. 24. ..Naive Bayes Algorithm.. <ul><li>And the probability with which term ti occurs in all documents from class C as P ( ti | C ) </li></ul><ul><li>This can be estimated with the number of times that ti occurs in all documents from class C over the total number of terms in the documents from class C. </li></ul>
  22. 26. ..Naive Bayes Algorithm.. <ul><li>First we calculate the probabilities P ( ti | C)] </li></ul><ul><li>For example, this happens with the term history and class A; that is, </li></ul><ul><li>P ( history | A) = 0 </li></ul><ul><li>Consequently, the documents, which have a nonzero count for history will have zero probability in class A. </li></ul><ul><ul><li>That is P (History | A) = 0, P (Music | A) = 0, and P (Philosophy | A) = 0. </li></ul></ul>
  23. 27. ..Naive Bayes Algorithm.. <ul><li>A common approach to avoid this problem is to use the Laplace estimator . </li></ul><ul><li>The idea is to add 1 to the frequency count in the numerator and 2 (or the number of classes, if more than two) to the denominator. </li></ul><ul><li>The Laplace estimator helps to deal with a zero probability situation. </li></ul>
  24. 28. ..Naive Bayes Algorithm.. <ul><li>Now we compute the probabilities of each term </li></ul><ul><li>given each class using the Laplace estimator. For example, </li></ul><ul><li>P (history | A) = (0 + 1)/(57 + 2) = 0.017 and </li></ul><ul><li>P(history | B) = (9 + 1)/(29 + 2) = 0.323. </li></ul><ul><li>Plugging all these probabilities in the formula results in </li></ul>P(A | Theatre) ≈ 0.0000354208 and P(B | Theatre) ≈ 0.00000476511,
  25. 29. Numerical Approaches <ul><li>In the TFIDF vector space framework, we use cosine similarity as a measure of document similarity. </li></ul><ul><li>However, the same vector representation allows documents to be considered as points in a metric space. </li></ul><ul><li>That is, given a set of points, the objective is to find a surface that divides the space in two parts, so that the points that fall in each part belong to a single class. </li></ul><ul><li>Linear regression , the most popular approach based on this idea. </li></ul>
  26. 30. ..Numerical Approaches.. <ul><li>Linear regression is a standard technique for numerical prediction. </li></ul><ul><li>It works naturally with numerical attributes (including the class). </li></ul><ul><li>The class value C predicted is computed as a linear combination of the attribute values xi as follows: </li></ul>
  27. 31. ..Numerical Approaches.. <ul><li>The objective is to find the coefficients wi given a number of training instances xi with their class values C. </li></ul><ul><li>There are several approaches to the use of linear regression for classification (predicting class labels). </li></ul><ul><li>One simple approach to binary classification is to substitute class labels with the values −1 and 1. </li></ul><ul><li>The predicted class is determined by the sign of the linear combination. </li></ul><ul><li>For example, consider our six-attribute document </li></ul><ul><li>vectors (Table 5.1). Let us use −1 for class A and 1 for class B </li></ul>
  28. 32. ..Numerical Approaches.. <ul><li>Then the task is to find seven coefficients w 0 ,w 1 , . . . , w 6 which satisfy a system of 19 linear equations </li></ul><ul><li>The result is positive, and thus the class predicted for Theatre is B and also agrees with the prediction of </li></ul><ul><li>1-NN. </li></ul>
  29. 33. RELATIONAL LEARNING <ul><li>All classification methods that we have discussed so far are based solely on the document content and more specifically on the bag-of-words model. </li></ul><ul><li>Many additional document features, </li></ul><ul><ul><li>such as the internal HTML structure, language structure, and interdocument link structure, are ignored. </li></ul></ul><ul><ul><li>All this may be a valuable source of information for the classification task. </li></ul></ul><ul><ul><li>The basic problem with this information into the classification algorithm is the need for uniform representation . </li></ul></ul>
  30. 34. ..RELATIONAL LEARNING.. <ul><li>Relational learning extends content-based approach to a relational representation. </li></ul><ul><li>Allows various types of information to be represented in a uniform way and used for web document classification. </li></ul><ul><li>In our domain we have documents d and terms t connected with the basic relation contains. </li></ul><ul><li>That is, if term t occurs in document d , the relation contains ( d , t ) is true. </li></ul>

×