Upcoming SlideShare
×

# Data.Mining.C.6(II).classification and prediction

3,465 views

Published on

1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
3,465
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
71
0
Likes
1
Embeds 0
No embeds

No notes for slide
• The smaller the distance between two points, the more similar
• ### Data.Mining.C.6(II).classification and prediction

1. 1. <ul><li>Chapter 6 (Part II) </li></ul><ul><li>Alternative Classification Technologies </li></ul><ul><li>第六章 ( 第二部分 ) </li></ul><ul><li>数据分类技术 </li></ul>
2. 2. Chapter 6 (II) Alternative Classification Technologies <ul><li>- Instance ( 示例 ) based Approach </li></ul><ul><li>- Ensemble ( 组合 ) Approach </li></ul><ul><li>- Co-training Approach </li></ul><ul><li>- Partially Supervised Approach </li></ul>
3. 3. Instance-Based ( 基于示例 ) Approach <ul><li>Store the training records </li></ul><ul><li>Use training records to predict the class label of unseen cases directly </li></ul>
4. 4. Instance-Based Method <ul><li>Typical approach </li></ul><ul><ul><li>k -nearest neighbor approach (kNN) k- 邻近法 </li></ul></ul><ul><ul><ul><li>Instances represented as points in a Euclidean space. </li></ul></ul></ul><ul><ul><ul><li>Uses k “ closest ” points (nearest neighbors) for performing classification </li></ul></ul></ul>
5. 5. Nearest Neighbor Classifiers <ul><li>Basic idea: </li></ul><ul><ul><li>If it walks like a duck, quacks like a duck, then it ’ s probably a duck </li></ul></ul>Training Records Test Record Compute Distance (similarity) Choose k of the “nearest” records (i.e., most similar)
6. 6. Nearest-Neighbor Classifiers <ul><li>Requires three things </li></ul><ul><ul><li>The set of stored records </li></ul></ul><ul><ul><li>Metric （度量） to compute distance between records </li></ul></ul><ul><ul><li>The value of k, the number of nearest neighbors to retrieve </li></ul></ul><ul><li>To classify an unknown record: </li></ul><ul><ul><li>Compute distance to other training records </li></ul></ul><ul><ul><li>Identify k nearest neighbors </li></ul></ul><ul><ul><li>Use class labels of nearest neighbors to determine the class label of unknown record (e.g., majority vote) </li></ul></ul>
7. 7. Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
8. 8. Key to kNN Approach <ul><li>Compute near relationship between two points </li></ul><ul><li>- Similarity (closeness) measure </li></ul><ul><li>Determine the class from nearest neighbor list </li></ul><ul><ul><li>Take the majority vote of class labels among the k-nearest </li></ul></ul><ul><ul><li>neighbors </li></ul></ul>
9. 9. Distance- based Similarity Measure <ul><li>Distances are normally used to measure the similarity between two data objects </li></ul><ul><li>Euclidean distance( 欧几里德距离 ): </li></ul><ul><ul><li>Properties </li></ul></ul><ul><ul><ul><li>d(i,j)  0 </li></ul></ul></ul><ul><ul><ul><li>d(i,i) = 0 </li></ul></ul></ul><ul><ul><ul><li>d(i,j) = d(j,i) </li></ul></ul></ul><ul><ul><ul><li>d(i,j)  d(i,k) + d(k,j) </li></ul></ul></ul>
10. 10. Boolean type 布尔型 <ul><li>A contingency table for binary data (state value: 0 or 1) </li></ul><ul><li>Simple matching coefficient( 简单系数匹配 ) </li></ul>Object i Object j
11. 11. Distance based Measure for Categorical Type( 标称型 ) of Data <ul><li>Categorical type </li></ul><ul><li>i.e., red, yellow, blue, green for the nominal variable color </li></ul><ul><li>Method: Simple matching </li></ul><ul><ul><li>m : # of matches, p : total # of variables </li></ul></ul>
12. 12. Distance based Measure for Mixed Types ( 混合型 ) of Data <ul><li>A object (tuple) may contain all the types mentioned above </li></ul><ul><li>May use a weighted formula to combine their effects. </li></ul><ul><ul><li>Given p kinds of the different type variables: </li></ul></ul><ul><ul><li>if x if = x jf =0 (or missing value) ； </li></ul></ul><ul><ul><li>otherwise </li></ul></ul><ul><ul><li>f is boolean ( 布尔 ) or categorical ( 标称 ): </li></ul></ul><ul><ul><ul><li>d ij (f) = 0 if x if = x jf , or d ij (f) =1 </li></ul></ul></ul><ul><ul><li>f is interval-valued: use the normalized distance </li></ul></ul>
13. 13. K-Nearest Neighbor Algorithm <ul><li>Input: Let k be the number of nearest neighbors and D be the </li></ul><ul><li>set of training examples </li></ul><ul><li>For each test example z =(x’, y’) do </li></ul><ul><li>Computer d(x’, x ), the distance between z and every </li></ul><ul><li>example ( x , y)  D. </li></ul><ul><li>Select D z  D, the set of k closest training examples to z </li></ul><ul><li>End for </li></ul>
14. 14. Measure for Other Types of Data <ul><li>Textual Data: Vector Space Representation </li></ul><ul><li>A document is represented as a vector: </li></ul><ul><ul><ul><li>(W1, W2, … … , Wn) </li></ul></ul></ul><ul><ul><li>Binary: </li></ul></ul><ul><ul><ul><li>Wi= 1 if the corresponding term i (often a word) is in the document </li></ul></ul></ul><ul><ul><ul><li>Wi= 0 if the term i is not in the document </li></ul></ul></ul><ul><ul><li>TF: (Term Frequency) </li></ul></ul><ul><ul><ul><li>Wi= tfi tfi is the number of times the term occurred in the document </li></ul></ul></ul>
15. 15. Similarity Measure for Textual Data <ul><li>Distance- based Similarity Measure </li></ul><ul><li>Problems: high dimension and data sparseness </li></ul>
16. 16. Other Similarity Measure <ul><li>The “ closeness ” between documents is calculated as the correlation between the vectors that represent them, using measures such as the cosine of the angle between these two vectors. </li></ul>Cosine measure ( 余弦计算方法 ) ：
17. 17. Discussion on the k -NN Algorithm <ul><li>k-NN classifiers are lazy learners (or learning from your neighbors) </li></ul><ul><ul><li>It does not build models explicitly </li></ul></ul><ul><ul><li>Unlike eager learners such as decision tree induction </li></ul></ul><ul><ul><li>Classifying unknown records are relatively expensive </li></ul></ul><ul><ul><li>Robust to noisy data by averaging k-nearest neighbors </li></ul></ul>
18. 18. Chapter 6 (II) Alternative Classification Technologies <ul><li>- Instance ( 示例 )based Approach </li></ul><ul><li>- Ensemble ( 组合 ) Approach </li></ul><ul><li>- Co-training Approach </li></ul><ul><li>- Partially Supervised Approach </li></ul>
19. 19. Ensemble Methods <ul><li>Construct a set of classifiers from the training data </li></ul><ul><li>Predict class label of previously unseen records by aggregating predictions made by multiple classifiers </li></ul>
20. 20. General Idea
21. 21. Examples of Ensemble Approaches <ul><li>How to generate an ensemble of classifiers? </li></ul><ul><ul><li>Bagging </li></ul></ul><ul><ul><li>Boosting </li></ul></ul>
22. 22. Bagging <ul><li>Sampling with replacement </li></ul><ul><li>Build classifier on each bootstrap sample set ( 自助样本集 ) </li></ul>
23. 23. Bagging Algorithm Let k be the number of bootstrap samples set For i =1 to k do Create a bootstrap sample D i of Size N Train a (base) classifier C i on D i End for
24. 24. Boosting <ul><li>An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records </li></ul><ul><ul><li>Initially, all N records are assigned equal weights </li></ul></ul><ul><ul><li>Unlike bagging, weights may change at the end of boosting round </li></ul></ul>
25. 25. Boosting <ul><li>Records that are wrongly classified will have their weights increased </li></ul><ul><li>Records that are classified correctly will have their weights decreased </li></ul><ul><li>Example 4 is hard to classify </li></ul><ul><li>Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds </li></ul>
26. 26. Boosting C 1 T D 1 F (D 2 ) C 2 T D m … C m T The process of generating classifiers F
27. 27. Boosting <ul><li>Problems: </li></ul><ul><li>How to update the weights of the training examples? </li></ul><ul><li>How to combine the predictions made by each base classifier? </li></ul>
28. 28. AdaBoosting Algorithm <ul><li>Given ( x j , y j ) : a set of N training examples ( j=1,…,N ) </li></ul>The error rate of a base classifier C i : where I(p) = 1 if p is true, and 0 otherwise. The importance of a classifier C i :
29. 29. AdaBoosting Algorithm The weight update mechanism (Equation): where is the normalization factor: : the weight for example ( x i , y i ) during the round
30. 30. AdaBoosting Algorithm Let k be the number of boosting rounds, D is the set of all examples Update the weight of each examples according to Equation End for , Initialize the weights for all N examples For i = 1 to k do Create training set D i by sampling from D according to W . Train a base classifier C i on D i Apply C i to all examples in the original set D
31. 31. Increasing Classifier Accuracy <ul><li>Bagging and Boosting : </li></ul><ul><li>- general techniques for improving classifier accuracy </li></ul><ul><li>Combining a series of T learned classifiers C1, … ,CT with the aim of </li></ul><ul><li>creating an improved composite classifier C* </li></ul>Data C 1 C T C 2 … Combine Votes New data sample Class prediction
32. 32. Chapter 6 (II) Alternative Classification Technologies <ul><li>- Instance ( 示例 ) based Approach </li></ul><ul><li>- Ensemble ( 组合 ) Approach </li></ul><ul><li>- Co-training Approach </li></ul><ul><li>- Partially Supervised Approach </li></ul>
33. 33. Unlabeled Data <ul><li>One of the bottlenecks of classification is the labeling of a large set of examples (data records or text documents). </li></ul><ul><ul><li>Often done manually </li></ul></ul><ul><ul><li>Time consuming </li></ul></ul><ul><li>Can we label only a small number of examples and make use of a large number of unlabeled data to classifying? </li></ul>
34. 34. Co-training Approach <ul><li>Blum and Mitchell, (CMU, 1998) </li></ul><ul><ul><li>Two “independent” views: split the features into two sets. </li></ul></ul><ul><ul><li>Train a classifier on each view. </li></ul></ul><ul><ul><li>Each classifier labels data that can be used to train the other classifier , and vice versa </li></ul></ul>
35. 35. Co-Training Approach Feature Set X=(X1, X2) Classification Model One Classification Model Two new labeled data set 1 subset X1 subset X2 training training new labeled data set 2 classifying classifying Unlabeled data Unlabeled data example set L example set L
36. 36. Two views <ul><li>Features can be split into two independence sets(views): </li></ul><ul><ul><li>The instance space: </li></ul></ul><ul><ul><li>Each example: </li></ul></ul><ul><li>A pair of views x 1 , x 2 satisfy view independence just in case: </li></ul><ul><li>Pr[X 1 =x 1 | X 2 =x 2 , Y=y] = Pr[X 1 =x 1 |Y=y] </li></ul><ul><li>Pr[X 2 =x 2 | X 1 =x 1 , Y=y] = Pr[X 2 =x 2 |Y=y] </li></ul>
37. 37. Co-training algorithm For instance, p=1, n=3, k=30, and u=75
38. 38. Co-training: Example <ul><li>Ca. 1050 web pages from 4 CS depts </li></ul><ul><ul><li>pages (25%) as test data </li></ul></ul><ul><ul><li>The remaining 75% of pages </li></ul></ul><ul><ul><ul><li>Labeled data: 3 positive and 9 negative examples </li></ul></ul></ul><ul><ul><ul><li>Unlabeled data: the rest (ca. 770 pages) </li></ul></ul></ul><ul><li>Manually labeled into a number of categories: e.g., “course home page”. </li></ul><ul><li>Two views: </li></ul><ul><ul><li>View #1 (page-based): words in the page </li></ul></ul><ul><ul><li>View #2 (hyperlink-based): words in the hyperlinks </li></ul></ul><ul><li>Naïve Bayes Classifier </li></ul>
39. 39. Co-training: Experimental Results <ul><li>begin with 12 labeled web pages (course, etc) </li></ul><ul><li>provide ca. 1,000 additional unlabeled web pages </li></ul><ul><li>average error: traditional approach 11.1%; </li></ul><ul><li>average error: co-training 5.0% </li></ul>
40. 40. Chapter 6 (II) Alternative Classification Technologies <ul><li>- Instance ( 示例 ) based Approach </li></ul><ul><li>- Ensemble ( 组合 ) Approach </li></ul><ul><li>- Co-training Approach </li></ul><ul><li>- Partially Supervised Approach </li></ul>
41. 41. Learning from Positive & Unlabeled Data <ul><li>Positive examples : a set of examples of a class P , </li></ul><ul><li>Unlabeled set : a set U of unlabeled (or mixed) examples </li></ul><ul><li>with instances from P and also not from P ( negative examples ). </li></ul><ul><li>Build a classifier : Build a classifier to classify the examples in U and/or future (test) data. </li></ul><ul><li>Key feature of the problem : no labeled negative training data. </li></ul><ul><li>We call this problem, PU-learning . </li></ul>
42. 42. Positive und Unlabeled
43. 43. Direct Marketing <ul><li>Company has database with details of its customer – positive examples, but no information on those who are not their customers, i.e., no negative examples . </li></ul><ul><li>Want to find people who are similar to their customers for marketing. </li></ul><ul><li>Buy a database consisting of details of people -- who may be potential customers ? </li></ul>
44. 44. Novel 2-steps strategy <ul><li>Step 1: Identifying a set of reliable negative documents </li></ul><ul><li>from the unlabeled set. </li></ul><ul><li>Step 2: Building a sequence of classifiers by iteratively </li></ul><ul><li>applying a classification algorithm and then </li></ul><ul><li>selecting a good classifier. </li></ul>
45. 45. Two Steps Process
46. 46. Step 1 Step 2 positive negative Reliable Negative (RN) Q =U - RN U P positive Using P, RN and Q to build the final classifier iteratively or Using only P and RN to build a classifier Existing 2-step strategy
47. 47. Step 1: The Spy technique <ul><li>Sample a certain % of positive examples and put them into unlabeled set to act as “spies”. </li></ul><ul><li>Run a classification algorithm (e.g. Naïve Bayes Approach) assuming all unlabeled examples are negative, </li></ul><ul><ul><li>we will know the behavior of those actual positive examples in the unlabeled set through the “spies”. </li></ul></ul><ul><li>We can then extract reliable negative examples from the unlabeled set more accurately. </li></ul>
48. 48. Step 2: Running a classification algorithm iteratively <ul><li>Running a classification algorithm iteratively </li></ul><ul><ul><li>iteratively using P, RN and Q until this no document from Q can be classified as negative. RN and Q are updated in each iteration </li></ul></ul>
49. 49. PU-Learning <ul><li>Heuristic ( 启发式 ) methods ： </li></ul><ul><ul><li>Step 1 tries to find some initial reliable negative examples from the unlabeled set. </li></ul></ul><ul><ul><li>Step 2 tried to identify more and more negative examples iteratively. </li></ul></ul><ul><li>The two steps together form an iterative strategy </li></ul>