This document discusses applying data mining techniques to predict whether a football match will be cancelled due to weather. Attributes that could be used in the prediction include amount of rain, temperature, humidity, and weather conditions. The data would come from weather stations and the football club. A second example discusses using attributes like player numbers, injuries, past goals, and team performance to predict the outcome of a match between Ajax and Real Madrid. The document also explains the difference between a training set, which is used to weight attributes, and a test set, which is used to evaluate the training set's predictions. Key data mining concepts like features, instances, and classes are briefly defined with examples.
2. Assignment
1
Exercise 1: Data Mining in General
Describe in half a page to one page two scenarios to which you think one could apply data
mining. Preferably these two scenarios should be relevant to your professional or personal
interests. Describe what you would like to predict with data mining methods and what the
relevant attributes in these applications are. Describe also what type of data you would use
and what kind of problems you could anticipate.
What is the chance of the cancelling of the football match of next Sunday at 14:30?
It happened a lot of times that a football matches was being cancelled due to bad weather in my
competition. Unfortunately it always takes a while for being noticed of this cancellation. It would be great
to be able to predict if a football match would continue and that we would be able to plan other stuff with
or without the team during our Sunday or not.
To be able to predict if the match continues we should take several attributes into account where we
base our final decision on. Attributes which we could take into account are:
• Amount Of Rain (mm), we would like to know the amount of rain which has fallen for a certain
amount of time. (hours/days/weeks)
• Temperature (degrees), the temperature should definitely being taken into account.
• Humidity, We can give a numerical value to the number of moist (mm) within a cubic meter.
• Weatherconditions, this attribute could be made out of discrete values like sunny, overcast and
rainy.
• Has scattered sand, this could be a condition where we would like to know if the groundsman
of this football field has used sand for draining the water from the field
• Has artificial grass, in case we play on artificial grass we can less worry about the weather
conditions.
• Will another match being played at the same time (Yes/No)? In case we have just one artificial
grass field it could happen that another team gets primacy.
The data could come from several weather stations and the football club where we will be playing a
football match. The main problem to address is to find the moments where we can’t play football. What
are the weather conditions when matches are being cancelled compared to the ones when we do play?
What is the chance for football club Ajax of winning the next football match to Real Madrid?
Next Tuesday Ajax plays against Real Madrid and we would like to predict the outcome of this match. In
stead of hoping for an octopus which shows us the outcome of every match we would like that doing the
prediction ourselves. There are several attributes we can take into account:
• Number of players
• Number of injuries in team
• Number of goals during previous matches
• Outcomes of the clubs within their national competition
• Difference in value of players (how much is this player worth?)
The main problem to address within the example above is that football contains so much variables that
it’s really hard to address the bottlenecks of predicting. Because referees could also influence outcomes
of matches.
2
3. Assignment
1
Exercise 2: Training and Test Data
Describe the difference between a training set and a test set? What would happen if we do
not make that distinction and combine all available data into one single set?
Training set:
This set of data contains our weighted data. If we predict if we would play football tomorrow we can give
a certain weight to every attribute. If the weather is sunny it doesn’t matter anymore if we have a artificial
grass field or not.
Test set:
A test set is used to see if our training set does what it needs to do. Does my training set predict good
yes/no?
The main difference between these two is that the training set contains weighted values and the test set
doesn’t. We use the test used to see of our training set is put up the right way. If we wouldn’t make a
distinction between the two it could result in a very bad result because we haven’t used any true
measured values. I.e. our training set it’s weight hasn’t been set properly and we can’t play football on
sunny days because only one artificial field is available and 7 regular football fields.
Exercise 3: Data Characteristics
Briefly describe and provide an example for each of the following concepts:
1. Feature or attribute
A feature or attribute describes characteristics of an object.
2. Instance
An instance is a component of a class. If we would take the class footballclubsInEurope then the football
club Ajax would be an instance of it. Or when we take the class of the course Data mining. Then I’m an
instance of this class.
3. Classes.
With a class we mean a thing where we can put properties and functionalities under. And these should
have instances. If we would take all footballclubsInEurope then Ajax and Real Madrid are instances of
this supergroup.
3