In the remaining part of the
introduction, we take a look at the types of DM tasks, the subject of visualization
and review the range of DM tools.
We will distinguish between two basic sorts of Data Mining tasks: predictive and
descriptive (razlika?). In predictive tasks (predicting what?) we deal with
classification problems (?), where we want to distinguish between members of
different classes. We can do this by learning sets of rules or decision trees. We
typically have one target variable (binary? A in ne-A?), and this variable
(attribute?) would usually be discrete, like reading or not-reading or being
readers of Delo or being readers of Večer (bah), and so on. We can have a
binary classification problem or a multiclass classification problem, or we can
deal with prediction () (of?) and estimation (of?) tasks, and there we would
typically use regression analysis (oh?). The data can be contained (velja za vse)
in a single table or in a relational database or in multiple relational tables (bah), in
which case (ni na istem nivoju) we are dealing with so-called predictive
Relational Data Mining (RDM) (? – a obstaja descriptive RDM?) or (?) Inductive
Logic Programming (ILP). (ILP samo v PDM?)
In oOther sorts of data mining tasks, where we do not have a specified
classification variable, so we do not want to distinguish between members of
different classes (to je samo PDMa), but want to identify features of the data.
These , would be descriptive DM tasks, like description () and summarization,
or dependency analysis, like as in learning association rules. Descriptive DM also
deals with discovering properties and constraints, segmenting or clustering, and
discovering subgroups (?). So, if we try to show it on a picture (much too late;
and, the picture needs to be _here_), in the simplest classification problem, we
would distinguish between members of one class versus members of the other
(?) class and thate classification would be the ultimate goal. So, we would only
try to distinguish between the populations of one class and the other. On the
other hand, in descriptive DM, we don’t nedd to Whereas, here our data is not
labeled the data with a class label, and but no specific variable is the class or the
target variable,, and we are just interested in finding some properties that , which
hold overn parts of the data. We would perhaps want to cluster together the
individuals that , which are most similar to each other, and that has to do with
segmentation or clustering., Wwe couldan also try to find a characteristic
properties that are characteristic , which holds on for some part of the population,
or we could an do dependency analysis, such as like association rule learning, as
we actually did we have seen in some of the applications , which we have
Depending on the sort of data, Separately we can also deal with do text and ,
Web mining and, graph or image analysis, and so on. Again, , again the these
tasks can be either predictive or descriptive, although but then the specific
mechanisms for of doing data analysis data analysis are would obviously be
pretty very much different.
So, I have already shown this distinction between predictive and descriptive. In
predictive the main goal is to induce classifiers and we have classification and
prediction task, which we want to solve by inducing the classifiers from the data.
Classification rule learning, Decision tree learning, we have also Support vector
machines, which is one of the newer approaches to classification, extremely
effective and high quality classifiers can be obtained by Support vector
machines, then we have artificial neural networks, Bayesian classifiers and so
on. So, here we generate hypothesis out of the data, and then we test it. It is
similar in statistics, you can do different sorts of data analysis in such setting.
Whereas, with descriptive induction the task can be different, just like
Discovering interesting regularities in the data, not necessarily inducing a
classifier. Discovering patterns, here we could use Symbolic clustering,
Association rule learning, Subgroup discovery and that would be more in the face
of exploratory data analysis, as is done with statistical methods.
If we take a rule learning perspective, when we induce rules, in predictive
induction, when we want to induce classifiers, we are not looking at individual
rules, but we are inducing sets of rules for every class. So, we want to induce
sets of rules acting as a classifier for solving a classification and prediction task.
Whereas, in descriptive induction we are interested in finding individual rules,
which describe a certain interesting property or regularity in the data.
Also the distinction in terminology and also in the types of tasks we are solving is
between supervised and unsupervised learning. Supervised learning occurs
when there is a specific target variable representing a class, like you have
patients, some patients have a certain diagnosis, so different diagnosis would be
different classes, and you would have a certain variable, like diagnosis and
values of that variable would represent different classes. In that case we can say
we are dealing with a supervised learning problem. So instances, all the training
instances are labeled with a class label. Such a setting is used for predictive
induction, whereas, unsupervised learning setting, there we do not have a
specific target variable representing the class, no class assignment and this is
usually used for descriptive induction. We can do descriptive induction also in
label data, obviously inducing some individual regularities from the data, but that
would be the most commonly distinction. In subgroup discovery we have label
data, but are doing a kind of descriptive induction task. So we will talk about that
So I will skip this and I will show some visualization.
So let look at the visualization of the data. For instance here, you have three
variables with which data is described, here is age, time o hospitalization and the
time from the past injury … there was a hip implant put into the patient’s hip. And
these are different data point, and then different colors would represent different
classes. So this is one possible way of visualizing the data, this is called the
scatter plot. Obviously you can see that most data is here, but then you have
some outlier and then you can look at this particular data points and try to
understand, whether it is indeed an outlier, whether it represents noise, whether
there was some typing error in the description of that particular data point or
there is some regularity behind that. So visualization, as we have shown before
already in the traffic accident data analysis, it is a useful means in the data
You can use also other sorts of representation like a certain frequencies, but then
done in a different way, so for instance again for medical application, different
attributes would be described with different colors like sex, age, EP type,
implantation time and so on.
And then with different heights of these bars you would see the frequencies, for
instance, you could see that the patients are mostly female and very few male
Then you could also see some other regularities here, that with age (these are
young patients, these are older patients) certain value is growing. And then you
have the connections between certain values of attributes, where the strong
connections represent the high frequency connections. So, with such
representation you can also induce some regularities.
You could have a visualization of time series. This was done in replication for
hospital of Jesenice, when we were looking at an ineffectiveness of antibiotics in
the case of house bacteria growths in the hospital.
We could have visualization of induced rules, again based on frequencies. This
would be let say the distribution of healthy and non healthy patient in the entire
population. And then once we have induced a rule, describing a subset of the
population, a subgroup in the subgroup, such a subgroup would cover part of the
population and the distribution of the two classes in the subgroup would be
significantly different than in the entire population. So, in the subgroup discovery
approach, we are trying to find subsets of individuals, where the distribution is
significantly different in the subgroup compared to the initial distribution in the
entire population. And this would be visualization in that case this is another way
of visualizing subgroups induced by subgroup discovery algorithm.
This would be another way of finding subgroups. If you have, let say, patients
some with coronary heart disease and some others, which would be healthy,
then if this is a subgroup of patients described with a certain values of attributes,
with certain properties, then you would see that let say those people belonging to
this subgroup are mostly older patients, whereas, people belonging to this
subgroup, would range over larger scope of the age. We could also compare, in
the initial population, we would have approximately the same number of ill
patients compared to the number of healthy patients in the particular population
in a certain hospital.
This would be the visualization of association rules, with certain professional
Data Mining software, called DB-Miner. Visualization is like that, that for a certain
attribute equal value, there is an association between this and this, and then
depending on the heights of the bar you see, whether a certain association has
high ____ and high support.
So for instance, with MineSet, which is another Data Mining tool, they have
decided that the height of such a bar would represent support and the color of
the bar would represent confidence. If you recall, confidence and support were
two measures for evaluating the quality of associations, which were induced in
Here is a representation of a decision tree, so if you remember the decision tree,
which we had for the readers of teenage magazine Antena. We had the root of
the decision tree, and then the set of readers or not readers were split into
subsets and so on, so now if you turn this tree around and you go to the root,
here is the root of the decision tree, with a certain distribution of the classes –
class red and class green – and then you split the set of all individuals in the data
set into two subsets: one with a certain value of the most informative attribute,
which was Gleason Score, and the other with another value and then you come
to a different node, let say like this one, where there is a different distribution of
the two classes and then you split further, so it is a kind of visualization, like flying
an airplane over the landscape of possible pieces of knowledge.
There is number of tools, perhaps an interesting website is www.kdnuggets.com,
here you can find different sorts of siftware, some Data Mining suites supporting
multiple discovery tasks and data preparation, WEKA being one of them, which
includes a number of Data Mining algorithms. And then, there are specific tools
and algorithms for classification, clustering, statistics, links and association rule
learning, sequential Data mining, visualization, text, Web Mining and so on.
One of the tools is also Clementine, what is nice here is that it supports visual
programming. So you start with the data, you produce table out of this data, then
you run a certain decision tree learning algorithm and so on. So, also a program,
which is called Orange, which was developed at the Faculty of Informatics and
Computer Sciences the main authors are Blaž Zupan and Janez Demšar, so that
is also a toolbox for Data Mining, they also have a kind of nice user interface to
their Data Mining algorithms.
S-Plus you know.
So with this I would finish the introductory part, with which I would rep up and
then telling that KDD or Knowledge Discovery in Data Basis is the term, which is
used to describe the overall process of discovering useful knowledge in the data,
it includes data preparation, data cleaning, transformation, pre-processing. Data
Mining as well as then the evaluation of the processing, and Data Mining is just
one step in KDD process, which takes relatively minor part of the effort, but which
kind of involves all this technology, which we will be describing at our course.
Employing techniques from machine learning and statistics, Web Mining and so
on and so on. So we have describes two different sorts of tasks. Predictive and
descriptive tasks, which have different goals. With predictive tasks our goal is to
produce the best possible classifier, whereas, with descriptive tasks we are not
int at producing a classification model, but at producing what ever other
interesting patterns in the data. There is numerous applications of Data Mining,
many powerful tools available, and hopefully in this course you will find out more
about Data Mining.