Upcoming SlideShare
×

# I-2-ed.doc

342 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
342
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
3
0
Likes
0
Embeds 0
No embeds

No notes for slide

### I-2-ed.doc

1. 1. In the remaining part of the introduction, we take a look at the types of DM tasks, the subject of visualization and review the range of DM tools. We will distinguish between two basic sorts of Data Mining tasks: predictive and descriptive (razlika?). In predictive tasks (predicting what?) we deal with classification problems (?), where we want to distinguish between members of different classes. We can do this by learning sets of rules or decision trees. We typically have one target variable (binary? A in ne-A?), and this variable (attribute?) would usually be discrete, like reading or not-reading or being readers of Delo or being readers of Večer (bah), and so on. We can have a 1
2. 2. binary classification problem or a multiclass classification problem, or we can deal with prediction () (of?) and estimation (of?) tasks, and there we would typically use regression analysis (oh?). The data can be contained (velja za vse) in a single table or in a relational database or in multiple relational tables (bah), in which case (ni na istem nivoju) we are dealing with so-called predictive Relational Data Mining (RDM) (? – a obstaja descriptive RDM?) or (?) Inductive Logic Programming (ILP). (ILP samo v PDM?) H=? In oOther sorts of data mining tasks, where we do not have a specified classification variable, so we do not want to distinguish between members of different classes (to je samo PDMa), but want to identify features of the data. These , would be descriptive DM tasks, like description () and summarization, or dependency analysis, like as in learning association rules. Descriptive DM also deals with discovering properties and constraints, segmenting or clustering, and discovering subgroups (?). So, if we try to show it on a picture (much too late; and, the picture needs to be _here_), in the simplest classification problem, we would distinguish between members of one class versus members of the other (?) class and thate classification would be the ultimate goal. So, we would only try to distinguish between the populations of one class and the other. On the other hand, in descriptive DM, we don’t nedd to Whereas, here our data is not labeled the data with a class label, and but no specific variable is the class or the target variable,, and we are just interested in finding some properties that , which hold overn parts of the data. We would perhaps want to cluster together the individuals that , which are most similar to each other, and that has to do with segmentation or clustering., Wwe couldan also try to find a characteristic properties that are characteristic , which holds on for some part of the population, or we could an do dependency analysis, such as like association rule learning, as we actually did we have seen in some of the applications , which we have illustrated before. 2
3. 3. Depending on the sort of data, Separately we can also deal with do text and , Web mining and, graph or image analysis, and so on. Again, , again the these tasks can be either predictive or descriptive, although but then the specific mechanisms for of doing data analysis data analysis are would obviously be pretty very much different. So, I have already shown this distinction between predictive and descriptive. In predictive the main goal is to induce classifiers and we have classification and prediction task, which we want to solve by inducing the classifiers from the data. 3
4. 4. Classification rule learning, Decision tree learning, we have also Support vector machines, which is one of the newer approaches to classification, extremely effective and high quality classifiers can be obtained by Support vector machines, then we have artificial neural networks, Bayesian classifiers and so on. So, here we generate hypothesis out of the data, and then we test it. It is similar in statistics, you can do different sorts of data analysis in such setting. Whereas, with descriptive induction the task can be different, just like Discovering interesting regularities in the data, not necessarily inducing a classifier. Discovering patterns, here we could use Symbolic clustering, Association rule learning, Subgroup discovery and that would be more in the face of exploratory data analysis, as is done with statistical methods. If we take a rule learning perspective, when we induce rules, in predictive induction, when we want to induce classifiers, we are not looking at individual rules, but we are inducing sets of rules for every class. So, we want to induce sets of rules acting as a classifier for solving a classification and prediction task. Whereas, in descriptive induction we are interested in finding individual rules, which describe a certain interesting property or regularity in the data. 4
5. 5. Also the distinction in terminology and also in the types of tasks we are solving is between supervised and unsupervised learning. Supervised learning occurs when there is a specific target variable representing a class, like you have patients, some patients have a certain diagnosis, so different diagnosis would be different classes, and you would have a certain variable, like diagnosis and values of that variable would represent different classes. In that case we can say we are dealing with a supervised learning problem. So instances, all the training instances are labeled with a class label. Such a setting is used for predictive induction, whereas, unsupervised learning setting, there we do not have a specific target variable representing the class, no class assignment and this is usually used for descriptive induction. We can do descriptive induction also in label data, obviously inducing some individual regularities from the data, but that would be the most commonly distinction. In subgroup discovery we have label data, but are doing a kind of descriptive induction task. So we will talk about that later. So I will skip this and I will show some visualization. 5
6. 6. So let look at the visualization of the data. For instance here, you have three variables with which data is described, here is age, time o hospitalization and the time from the past injury … there was a hip implant put into the patient’s hip. And these are different data point, and then different colors would represent different classes. So this is one possible way of visualizing the data, this is called the scatter plot. Obviously you can see that most data is here, but then you have some outlier and then you can look at this particular data points and try to understand, whether it is indeed an outlier, whether it represents noise, whether there was some typing error in the description of that particular data point or 6
7. 7. there is some regularity behind that. So visualization, as we have shown before already in the traffic accident data analysis, it is a useful means in the data preparation face. You can use also other sorts of representation like a certain frequencies, but then done in a different way, so for instance again for medical application, different attributes would be described with different colors like sex, age, EP type, implantation time and so on. And then with different heights of these bars you would see the frequencies, for instance, you could see that the patients are mostly female and very few male patients. 7
8. 8. Then you could also see some other regularities here, that with age (these are young patients, these are older patients) certain value is growing. And then you have the connections between certain values of attributes, where the strong connections represent the high frequency connections. So, with such representation you can also induce some regularities. You could have a visualization of time series. This was done in replication for hospital of Jesenice, when we were looking at an ineffectiveness of antibiotics in the case of house bacteria growths in the hospital. 8
9. 9. We could have visualization of induced rules, again based on frequencies. This would be let say the distribution of healthy and non healthy patient in the entire population. And then once we have induced a rule, describing a subset of the population, a subgroup in the subgroup, such a subgroup would cover part of the population and the distribution of the two classes in the subgroup would be significantly different than in the entire population. So, in the subgroup discovery approach, we are trying to find subsets of individuals, where the distribution is significantly different in the subgroup compared to the initial distribution in the entire population. And this would be visualization in that case this is another way of visualizing subgroups induced by subgroup discovery algorithm. 9
10. 10. This would be another way of finding subgroups. If you have, let say, patients some with coronary heart disease and some others, which would be healthy, then if this is a subgroup of patients described with a certain values of attributes, with certain properties, then you would see that let say those people belonging to this subgroup are mostly older patients, whereas, people belonging to this subgroup, would range over larger scope of the age. We could also compare, in the initial population, we would have approximately the same number of ill patients compared to the number of healthy patients in the particular population in a certain hospital. 10
11. 11. This would be the visualization of association rules, with certain professional Data Mining software, called DB-Miner. Visualization is like that, that for a certain attribute equal value, there is an association between this and this, and then depending on the heights of the bar you see, whether a certain association has high ____ and high support. So for instance, with MineSet, which is another Data Mining tool, they have decided that the height of such a bar would represent support and the color of the bar would represent confidence. If you recall, confidence and support were two measures for evaluating the quality of associations, which were induced in the data. 11
12. 12. Here is a representation of a decision tree, so if you remember the decision tree, which we had for the readers of teenage magazine Antena. We had the root of the decision tree, and then the set of readers or not readers were split into subsets and so on, so now if you turn this tree around and you go to the root, here is the root of the decision tree, with a certain distribution of the classes – class red and class green – and then you split the set of all individuals in the data set into two subsets: one with a certain value of the most informative attribute, which was Gleason Score, and the other with another value and then you come to a different node, let say like this one, where there is a different distribution of the two classes and then you split further, so it is a kind of visualization, like flying an airplane over the landscape of possible pieces of knowledge. There is number of tools, perhaps an interesting website is www.kdnuggets.com, here you can find different sorts of siftware, some Data Mining suites supporting multiple discovery tasks and data preparation, WEKA being one of them, which includes a number of Data Mining algorithms. And then, there are specific tools and algorithms for classification, clustering, statistics, links and association rule learning, sequential Data mining, visualization, text, Web Mining and so on. 12
13. 13. One of the tools is also Clementine, what is nice here is that it supports visual programming. So you start with the data, you produce table out of this data, then you run a certain decision tree learning algorithm and so on. So, also a program, which is called Orange, which was developed at the Faculty of Informatics and Computer Sciences the main authors are Blaž Zupan and Janez Demšar, so that is also a toolbox for Data Mining, they also have a kind of nice user interface to their Data Mining algorithms. S-Plus you know. 13
14. 14. So with this I would finish the introductory part, with which I would rep up and then telling that KDD or Knowledge Discovery in Data Basis is the term, which is used to describe the overall process of discovering useful knowledge in the data, it includes data preparation, data cleaning, transformation, pre-processing. Data Mining as well as then the evaluation of the processing, and Data Mining is just one step in KDD process, which takes relatively minor part of the effort, but which kind of involves all this technology, which we will be describing at our course. Employing techniques from machine learning and statistics, Web Mining and so on and so on. So we have describes two different sorts of tasks. Predictive and descriptive tasks, which have different goals. With predictive tasks our goal is to produce the best possible classifier, whereas, with descriptive tasks we are not int at producing a classification model, but at producing what ever other interesting patterns in the data. There is numerous applications of Data Mining, many powerful tools available, and hopefully in this course you will find out more about Data Mining. 14