Upcoming SlideShare
×

# I-2-ed.doc.doc

312 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
312
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
4
0
Likes
0
Embeds 0
No embeds

No notes for slide

### I-2-ed.doc.doc

1. 1. In the remaining part of the introduction, we take a look at the types of DM tasks, the subject of visualization and review the range of DM tools. We will distinguish between two basic sorts of Data Mining tasks: predictive and descriptive (razlika?). In predictive tasks (predicting what?) we deal with classification problems (?), where we want to distinguish between members of different classes. We can do this by learning sets of rules or decision trees. We typically have one target variable (binary? A in ne-A? or one, with more values?), and this variable (attribute?) would usually be discrete, like reading or not- reading or being readers of Delo or being readers of Večer (bah), and so on. We 1
2. 2. can have a binary classification problem or a multiclass classification problem, or we can deal with prediction () (of?) and estimation (of?) tasks, and there we would typically use regression analysis (oh?). The data can be contained (velja za vse) in a single table or in a relational database or in multiple relational tables (bah), in which case (ni na istem nivoju) we are dealing with so-called predictive Relational Data Mining (RDM) (? – a obstaja descriptive RDM?) or (?) Inductive Logic Programming (ILP). (ILP samo v PDM?) H=hypothesis? In oOther sorts of data mining tasks, where we do not have a specified classification variable, so we do not want to distinguish between members of different classes (to je samo PDMa), but want to identify features of the data. (so, why not consider description first?) These , would be descriptive DM tasks, like description () and summarization, or dependency analysis, like as in learning association rules. Descriptive DM also deals with discovering properties () and constraints, segmenting or clustering, and discovering subgroups (=?). So, if we try to show it on a picture (much too late; and, the picture needs to be _here_), in the simplest classification problem, we would distinguish between members of one class versus members of the other (?) class and thate classification would be the ultimate goal. So, we would only try to distinguish between the populations of one class and the other. On the other hand, in descriptive DM, we don’t need to Whereas, here our data is not labeled the data with a class label, and but no specific variable is the class or the target variable,, and we are just interested in finding some properties that , which hold overn parts of the data. We would perhaps want to cluster together the individuals that , which are most similar to each other, and that has to do with segmentation or clustering., Wwe couldan also try to find a characteristic properties that are characteristic , which holds on for some part of the population, or we could an do dependency analysis, such as like association rule learning, as we actually did we have seen in some of the applications , which we have illustrated before. 2
3. 3. Depending on the sort of data, Separately we can also deal with do text and , Web mining and, graph or image analysis, and so on. Again, , again the these tasks can be either predictive or descriptive, although but then the specific mechanisms for of doing data analysis data analysis are would obviously be pretty very much different. The difference between predictive and descriptive (data mining or ?) induction was already indicated. In predictive DM, the main goal is to induce classifiers, whereas in descriptive induction … 3
4. 4. So, I have already shown this distinction between predictive and descriptive. In predictive the main goal is to induce classifiers Besides and we have classification and prediction task, which we want to solve by inducing the classifiers from the data. cClassification rules and learning, dDecision trees learning, we can also use we have also Support Vvector Mmachines (SVM). This , which is one of the newer approaches to classification, extremely effective and yielding high quality classifiers. can be obtained by Support vector machines, then Wwe also have Aartificial Nneural Nnetworks (ANNs), Bayesian classifiers and so on. Generally, what we do here is So, here we generate a hypothesis from out of the data, and then we test it. By contrast, It is similar in statistics, you can do different sorts of data analysis in such setting. Whereas, with in descriptive induction, we don’t necessarily induce classifiers, the task can be but any different, just like Discovering interesting regularities or patterns in the data, not necessarily inducing a classifier. For this, Discovering patterns, here we could use sSymbolic clustering, aAssociation rules learning or , sSubgroup discovery, anything in and that would be more in the nature face of exploratory data analysis, as is done with statistical methods. If we look at induction as we take a rule learning rules, then perspective, when we induce rules, in predictive induction, wheren we want to induce classifiers, we are not looking for at individual rules, but we are inducing sets of rules for every 4
5. 5. class ???. So, we want to induce sets of rules acting as a classifier for solving a classification and prediction task. By contrast, Whereas, in descriptive induction, we are interested in finding individual rules, which describe a certain interesting property or regularity in the data. Example (would help) ? Another lso the distinction in terminology and also in the types of tasks we are solving is between supervised and unsupervised learning. Supervised learning occurs when there is a specifiedc target variable representing a class, such as Diagnosis in a domain of like you have Ppatientss., Ssome patients have a certain diagnosis, so different diagnoseis would correspond to be different classes, and you would have a certain variable, such as like Ddiagnosis, that would have and values which of that variable would represent the different classes. In this at case, we can say that we are dealing with a supervised learning problem, because the diagnoses have been previously made, and the patients that have them are the . So instances, all the training instances, are labeled with a class label, the value of the Diagnosis variable.. So, why is it called supervised??? (the data is, not the learning itself) This is a typical Such a setting in is used for predictive induction., On the other hand, whereas, in unsupervised learning, setting, there we do not have any specifiedic target variable representing the class, and no class assignment, so and this is usually the setting used for descriptive induction. 5
6. 6. Of course, wWe can also do descriptive induction also oin labeled data, obviously inducing patterns and some individual regularities. from the data, but that would be the most commonly distinction. This is done in In subgroup discovery, we have label data, but are doing a kind of descriptive induction task. which will be treated So we will talk about that later. Slide 52 ?? (manjka) First, let us have a look at visualizing the data: So I will skip this and I will show some visualization. ??? (Nada ) 6
7. 7. In this example, So let look at the visualization of the data. For instance here, you have there are three variables with which hospital data about patients with hip implants data is described. We have, their here is age, time of hospitalization and the the time from the past injury ??? … there was a hip implant put into the patient’s hip. And Tthese give the are different data point, and then different colors cwould then represent different classes. So this is one possible way of visualizing the data, this is called the scatter plot. Obviously Yyou can see that most of the data is in one part of the graph, here, but with some then you have some outliers. and then Yyou can then look at thoise particular data points and try to decide understand, whether they are it is indeed an outliers, or whether they it represents noise, for example whether there was some a typing error in the description of that a particular patient, or whether data point or there is some regularity behind all of itthat. So visualization, as we have already have shown before, before already in the analysis of traffic accident data analysis, it is a useful means in the phase of data preparation face. You can use also use other sorts of representation like a certain frequencies, but then done in a different way, so for instance again for medical application, different attributes would be described with different colors like sex, age, EP type, implantation time and so on. AAARGH 7
8. 8. And then with different heights of these bars you would see the frequencies, for instance, you could see that the patients are mostly female and very few male patients. Then you could also see some other regularities here, that with age (these are young patients, these are older patients) certain value is growing. And then you have the connections between certain values of attributes, where the strong connections represent the high frequency connections. So, with such representation you can also induce some regularities. 8
9. 9. You could have a visualization of time series. This was done in replication for hospital of Jesenice, when we were looking at an ineffectiveness of antibiotics in the case of house bacteria growths in the hospital. We could have visualization of induced rules, again based on frequencies. This would be let say the distribution of healthy and non healthy patient in the entire population. And then once we have induced a rule, describing a subset of the population, a subgroup in the subgroup, such a subgroup would cover part of the population and the distribution of the two classes in the subgroup would be significantly different than in the entire population. So, in the subgroup discovery approach, we are trying to find subsets of individuals, where the distribution is 9
10. 10. significantly different in the subgroup compared to the initial distribution in the entire population. And this would be visualization in that case this is another way of visualizing subgroups induced by subgroup discovery algorithm. This would be another way of finding subgroups. If you have, let say, patients some with coronary heart disease and some others, which would be healthy, then if this is a subgroup of patients described with a certain values of attributes, with certain properties, then you would see that let say those people belonging to this subgroup are mostly older patients, whereas, people belonging to this subgroup, would range over larger scope of the age. We could also compare, in the initial population, we would have approximately the same number of ill 10
11. 11. patients compared to the number of healthy patients in the particular population in a certain hospital. This would be the visualization of association rules, with certain professional Data Mining software, called DB-Miner. Visualization is like that, that for a certain attribute equal value, there is an association between this and this, and then depending on the heights of the bar you see, whether a certain association has high ____ and high support. So for instance, with MineSet, which is another Data Mining tool, they have decided that the height of such a bar would represent support and the color of the bar would represent confidence. If you recall, confidence and support were 11
12. 12. two measures for evaluating the quality of associations, which were induced in the data. Here is a representation of a decision tree, so if you remember the decision tree, which we had for the readers of teenage magazine Antena. We had the root of the decision tree, and then the set of readers or not readers were split into subsets and so on, so now if you turn this tree around and you go to the root, here is the root of the decision tree, with a certain distribution of the classes – class red and class green – and then you split the set of all individuals in the data set into two subsets: one with a certain value of the most informative attribute, which was Gleason Score, and the other with another value and then you come to a different node, let say like this one, where there is a different distribution of the two classes and then you split further, so it is a kind of visualization, like flying an airplane over the landscape of possible pieces of knowledge. 12
13. 13. There is number of tools, perhaps an interesting website is www.kdnuggets.com, here you can find different sorts of siftware, some Data Mining suites supporting multiple discovery tasks and data preparation, WEKA being one of them, which includes a number of Data Mining algorithms. And then, there are specific tools and algorithms for classification, clustering, statistics, links and association rule learning, sequential Data mining, visualization, text, Web Mining and so on. One of the tools is also Clementine, what is nice here is that it supports visual programming. So you start with the data, you produce table out of this data, then you run a certain decision tree learning algorithm and so on. So, also a program, which is called Orange, which was developed at the Faculty of Informatics and 13
14. 14. Computer Sciences the main authors are Blaž Zupan and Janez Demšar, so that is also a toolbox for Data Mining, they also have a kind of nice user interface to their Data Mining algorithms. S-Plus you know. So with this I would finish the introductory part, with which I would rep up and then telling that KDD or Knowledge Discovery in Data Basis is the term, which is used to describe the overall process of discovering useful knowledge in the data, it includes data preparation, data cleaning, transformation, pre-processing. Data Mining as well as then the evaluation of the processing, and Data Mining is just 14
15. 15. one step in KDD process, which takes relatively minor part of the effort, but which kind of involves all this technology, which we will be describing at our course. Employing techniques from machine learning and statistics, Web Mining and so on and so on. So we have describes two different sorts of tasks. Predictive and descriptive tasks, which have different goals. With predictive tasks our goal is to produce the best possible classifier, whereas, with descriptive tasks we are not int at producing a classification model, but at producing what ever other interesting patterns in the data. There is numerous applications of Data Mining, many powerful tools available, and hopefully in this course you will find out more about Data Mining. 15