Upcoming SlideShare
×

# slides

308 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

Views
Total views
308
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
8
0
Likes
0
Embeds 0
No embeds

No notes for slide

### slides

1. 1. Data Mining and Machine Learning Yen-Jen Oyang Dept. of Computer Science and Information Engineering
2. 2. Reference Books <ul><li>“ Data Mining” by Ian Witten and Eibe Frank. </li></ul><ul><li>“ Data Mining” by Jiawei Han and Micheline Kamber. </li></ul>
3. 3. Observations and Challenges in the Information Age <ul><li>A huge volume of information has been and is being digitized and stored in the computer. </li></ul><ul><li>Due to the volume of digitized information, effectively exploitation of information is beyond the capability of human being without the aid of intelligent computer software. </li></ul>
4. 4. An Example of Data Mining <ul><li>Given the data set shown on next slide, can we figure out a set of rules that predict the classes of objects? </li></ul>
5. 5. Data Set O （ 15,30 ） × （ 23,33 ） × （ 16,38 ） × （ 10,34 ） × （ 25,18 ） O （ 18,32 ） × O × O O × Class （ 13,22 ） （ 19,36 ） （ 13,34 ） （ 11,38 ） （ 9 ,32 ） （ 16,31 ） Data × × O × × O Class （ 21,28 ） O （ 12,33 ） （ 14,32 ） × （ 13,37 ） （ 18,39 ） O （ 11,31 ） （ 17,34 ） × （ 8 ,15 ） （ 15,35 ） × （ 9 ,23 ） （ 18,28 ） O （ 15,33 ） Data Class Data
6. 6. Distribution of the Data Set 。 。 10 15 20 30 。 。 。 。 。 。 。 。 × × × × × × × × × × × × × ×
7. 7. Rule Based on Observation
8. 8. Identifying Boundary of Different Classes of Objects
9. 9. Boundary Identified
10. 10. Data Mining / Knowledge Discovery <ul><li>The main theme of data mining is to discover unknown and implicit knowledge in a large dataset. </li></ul><ul><li>There are three main categories of data mining algorithms: </li></ul><ul><ul><li>Classification; </li></ul></ul><ul><ul><li>Clustering; </li></ul></ul><ul><ul><li>Mining association rule/correlation analysis. </li></ul></ul>
11. 11. Data Classification <ul><li>In a data classification problem, each object is described by a set of attribute values and each object belongs to one of the predefined classes. </li></ul><ul><li>The goal is to derive a set of rules that predicts which class a new object should belong to, based on a given set of training samples. Data classification is also called supervised learning . </li></ul>
12. 12. Applications of Data Classification <ul><li>One example is that a bank wants to develop an automatic mechanism that decides whether a credit card application should be approved or not based on existing customers’ records. </li></ul><ul><li>Another example is that a hospital wants to determine whether a new patient belongs to the high-risk group of a particular disease, based on the patient’s health record. </li></ul>
13. 13. An Example of Data Classification Applications Poor Male No Young Middle High school Good Female ----- Middle Middle College Poor Male No Middle Low High school Good Female No Young Middle College Good Female Yes Old High High school Poor Male Yes Old High College Poor Female Yes Young Middle High school Good Male No Middle Low High school Good Male Yes Old High College Credit rating Sex Own House Age Annual Income Education Class Attributes
14. 14. <ul><li>The rule derived is as follows: </li></ul><ul><ul><li>If (education = high school) and ~(income = high), then credit rating = poor. </li></ul></ul><ul><ul><li>Otherwise, credit rating = good. </li></ul></ul><ul><li>Most of time, the rules derived are not perfect. In other words, misprediction is unavoidable in most cases. In this example, the accuracy is 7/9 = 78%. </li></ul>
15. 15. Representation and Inference of Knowledge <ul><li>Knowledge represented in an interpretable form such as rules is one of the most important outputs of the data classification software. </li></ul><ul><li>Some classification algorithms may perform well in prediction/classification but does not output knowledge or rules, e.g. neural networks and support vector machine. </li></ul>
16. 16. <ul><li>In some data classification applications, we are not concerned about the knowledge based on which the decisions are made. For example, a credit card company wants to develop an automatic mechanism that determine the credit limits of new applications. </li></ul><ul><li>However, in many applications, it is of interest to learn the knowledge and even to conduct inference. </li></ul>
17. 17. Rule Generated by a RBF Network Based Learning Algorithm for the Previous Example <ul><li>Let and </li></ul><ul><li>If then prediction=“O”. </li></ul><ul><li>Otherwise prediction=“X”. </li></ul>
18. 18. 1.973 (15,35) 2.045 (17,34) 1.794 (16,31) 1.794 (14,32) 1.794 (13,34) 2.027 (15,30) 1.794 2.327 2.745 1.723 (12,33) (18,32) (11,31) (15,33) 2.939 (13,37) 2.745 (16,38) 5.451 (18,28) 3.287 (18,39) 6.260 3.232 3.587 3.463 4.562 5.070 5.322 10.86 10.08 6.458 (13,22) (10,34) (19,36) (11,38) (9,32) (21,28) (23,33) (25,18) (8,15) (9,23)
19. 19. Alternative Data Classification Algorithms <ul><li>Decision tree (Q4.5 and Q5.0); </li></ul><ul><li>Instance-based learning(KNN); </li></ul><ul><li>Naïve Bayesian classifier; </li></ul><ul><li>Support vector machine(SVM); </li></ul><ul><li>Novel approaches including the RBF network based classifier that we have recently proposed. </li></ul>
20. 20. Accuracy of Different Classification Algorithms classification algorithms 99.92 99.91 99.92 99.94 Shuttle (43500,14500) 95.33 94.84 96.40 96.45 Average 95.46 95.26 97.98 97.12 Letter (15000,5000) 90.6 89.35 91.30 92.30 Satimage (4335,2000) 3NN 1NN SVM RBF Data set
21. 21. Comparison of Execution Time(in seconds) 11.53 7.4 21.3 Satimage Test 94.91 51.74 128.6 Letter 2.13 5.85 996.1 Shuttle 21.66 0.85 5.91 Satimage Make classifier 282.05 6.48 17.05 Letter 129.84 0.69 1745 Shuttle 64622 265 670 Satimage Cross validation 386814 1724 2825 Letter 467825 59.9 96795 Shuttle SVM RBF with data reduction RBF without data reduction
22. 22. More Insights 1.44% 51.96% 40.92% % of training samples remaining Shuttle Letter Satimage 287 8931 1689 # of support vectors in identified by LIBSVM 99.32 96.18 92.15 Classification accuracy after data reduction is applied 627 7794 1815 # of training samples after data reduction is applied 43500 15000 4435 # of training samples in the original data set
23. 23. Instance-Based Learning <ul><li>In instance-based learning, we take k nearest training samples of a new instance ( v 1 , v 2 , …, v m ) and assign the new instance to the class that has most instances in the k nearest training samples. </li></ul><ul><li>Classifiers that adopt instance-based learning are commonly called the KNN classifiers. </li></ul>
24. 24. Example of the KNN Classifiers <ul><li>If an 1NN classifier is employed, then the prediction of “  ” = “X”. </li></ul><ul><li>If an 3NN classifier is employed, then prediction of “  ” = “O”. </li></ul>
25. 25. Data Clustering <ul><li>Data clustering concerns how to group a set of objects based on their similarity of attributes and/or their proximity in the vector space. Data clustering is also called unsupervised learning. </li></ul>
26. 26. Applications of Data Clustering <ul><li>One application is to cluster the customers of a bank so that the bank can provide services more effectively. </li></ul><ul><li>For example, the bank may find the following clusters in its customers: </li></ul><ul><ul><li>aggressive investors; </li></ul></ul><ul><ul><li>conservative investors; </li></ul></ul><ul><ul><li>balanced investors. </li></ul></ul>
27. 27. Related Challenging Issues <ul><li>Two challenging issues associated with data clustering and classification: </li></ul><ul><ul><li>feature selection; </li></ul></ul><ul><ul><li>outlier detection. </li></ul></ul>
28. 28. Importance of Feature Selection <ul><li>Inclusion of features that are not correlated to the classification decision may make the problem even more complicated. </li></ul><ul><li>For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “  ”, if a 3NN classifier is employed. </li></ul>
29. 29. <ul><li>It is apparent that “o”s and “x” s are separated by x =10. If only the attribute corresponding to the x -axis was selected, then the 3NN classifier would predict the class of “  ” correctly. </li></ul>x =10 x y
30. 30. Summary <ul><li>Data clustering and data classification have been widely used in biological and medical data analysis. </li></ul><ul><li>Statistical analysis is probably the most important tool in various data mining algorithms. </li></ul>