slide-02-data-mining-Input_output-1.pptx

Introduction to Data Mining
Why use Data Mining?
Lecturer: Abdullahi Ahamad Shehu
(M.Sc. Data Science, M.Sc. Computer Science)
Office: Faculty of Computing Extension

10 March 2025 2
Content
• Input
• Output

Examples of Concepts
• How to decide whether there is an attempt to intrude in the network.
• How to determine whether somebody has a specific illness
• How to decide whether there is credit card misuse
• How to conclude whether a contract is good or not
• How to predict computer performance
• How to determine which products people buy together
• Which groupings can be established from a set of examples

Types of Concepts
• Classification
• learn to classify unclassified examples from classified ones
• e.g. how to decide whether to give a loan
• Association learning
• learn associations between attributes
• e.g. what supermarket products people buy together
• Clustering
• group examples together
• e.g. given a set of documents, divide them into groups
• Numeric prediction
• the output to be learned is numeric
• e.g. calculate the price of a car

Concept Description
Outlook Temp Humidity Windy Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No Yes
Rainy Mild Normal No Yes
If outlook = sunny and humidity = high
then play = no
If outlook = rainy and wind = yes
then play = no
If outlook = cloudy
then play = yes
If humidity = normal
then play = yes
If none of the above rules applies
then play = yes
Concept description:
output of our data mining tool

Instances (or Examples)
• Instance: a single example of a concept
• described by a set of attributes (features, columns)
• Input to learning algorithm
• set of instances
• usually described as a single relation/flat file.
instances

Attributes (or features)
• Attribute: describes a specific characteristic of an instance
• e.g. age, salary, …
• Attributes are often predefined for a set of instances
• an instance is described by its attribute values
• e.g. 25, 20567, …
attribute
attribute
value

Problems with Attributes
• Not all instances have values for all attributes
• e.g. patient’s family history is unknown
• Existence of an attribute may depend on value of another attribute.
• e.g. attribute pregnant conditional on gender = female
• Not all attributes are important
• E.g. person’s nose shape vs. whether to give them a loan
• Need feature selection to identify the important ones

Types of Attribute
• Nominal
• values are symbolic, e.g. desk, table, bed, wardrobe
• no relation between nominal values
• Boolean attributes are a special case
• 0 and 1 or True and False
• also called categorical, enumerated or discrete
• Ordinal
• values are ordered, e.g. small, medium, large, x-large
• small < medium < large < x-large
• but difference between 2 values is not meaningful

Types of Attribute
• Interval
• quantities are ordered
• measured in fixed equal units, years 2001, 2002, 2003, 2004,
• difference between values meaningful: 2005 - 2004
• but sum or product is not meaningful: 2005 + 2004
• Ratio
• Quantities include a natural zero
• Money: 0, 10, 100, 1000
• treated as real numbers because all mathematical operations
are meaningful

Preparing the Input
• Need to obtain a dataset in ‘correct format’
• Possible when there is a limited set of finite relations
outlook temp humidity windy play?
sunny 85 85 no no
sunny 80 90 yes no
cloudy 83 86 no yes
rainy 70 96 no yes
instances
attributes class
attribute values

Preparing the Input
• For example, create data in Excel
• save as .csv file
• Values of attributes
• Temperature, Humidity: numeric
• Outlook: Sunny, Cloudy, Rainy
• Windy, Play?: Yes, No
outlook temp humidity windy play?
sunny 85 85 no no
sunny 80 90 yes no
cloudy 83 86 no yes
rainy 70 96 no yes

Problems Preparing the Input
• Data may come from different sources
• e.g. different departments within a company
• Variation in record keeping
• Style
• Data aggregation (hourly, weekly, monthly etc.)
• Synonyms
• Errors
• Data must be assembled, integrated, aggregated and
cleaned

Preparing data
• Wrangling - transforming data into another format to make it more
suitable and valuable for a task
• Cleansing (cleaning) - detecting and correcting errors in the data.
• Scraping - automatic extraction of data from a data source.
• Integration - combining data from several disparate sources into a
(useful) dataset
14/14

Missing data
• Missing data may be unknown, unrecorded, irrelevant
• Causes
• Equipment faults
• Difficult to acquire (e.g. age, income)
• Measurement is not possible
• The fact that a value is missing may be informative
• e.g. missing test in medical examination
• BUT this is NOT usually the case
• Represented in R as NA

Inaccurate Values
• Errors and omissions which do not affect original purpose of data
collection
• E.g. age of bank customers not important
• E.g. Customers IDs not important
• Typographical errors in nominal attributes
• e.g. Pepsi vs Pepsi-cola
• Deliberate errors
• E.g. People may lie about their mental health history
• Duplicates
• ML algorithms very sensitive to this

Summary
• Preparing data for input is difficult and demanding
• data may need assembling, integrating, aggregating and cleaning
• if data set is huge, a sample may be used
• need relation, attributes and data (or instances)
• Various types of data may be used
• Nominal and numeric are most common

10 March 2025 18
Content
• Input
• Output

Data mining : Output
• Output: requirements
• Types of output
• Tables
• Rules
• Trees
• Instances
• Clusters
• Summary

Understanding the Output
• Output must be easy to understand
• representation of output is key
• Representation is NOT independent from learning process
• learning algorithm used determines representation of output
• depends on type of algorithm

Representations
• Decision tables
• Classification rules
• Association rules
• Decision trees
• Regression Trees for numeric prediction
• Instance-based representation
• Clusters

Decision tables
• Uses same format as input but only uses selected attributes
• Challenge: choosing selected attributes
• Decisions
• if outlook = sunny
and humidity = high
then play = no
• etc
outlook humidity play
sunny high no
sunny normal yes
cloudy high yes
…

Classification Rules
• IF conditions THEN conclusion
• if outlook = sunny
and humidity > 83
then play = no
• Conditions
• tests that have to be true for the rule to apply
• usually several conditions connected by ANDs
• Conclusion
• solution to the problem
• class, set of classes or probability distribution

Classification Rules: problems
• Rules may contradict each other
• Two applicable rules may give different classifications
• Decision List is an ordered set of rules
• first satisfied rule should be applied
• rule only applied if preceding ones are not applicable
• no contradictions in classification
• Rules may fail to classify an instance
• Decision List may have a final classification rule with no
conditions
• no classification failures
If outlook=sunny & humidity=high then play=no
If outlook=rainy & windy=true then play=no
...
If none of the above then play=yes

Association Rules
• Like classification rules BUT
• used to infer the value of any attribute (not just class)
• or a combination of attributes
• NOT intended to be used together as a set
• different association rules determine different things
• Problem: many different association rules can be derived from a small
dataset
• Restrict to associations with
• high support: number of instances it predicts correctly
• high confidence (accuracy): proportion of instances it predicts correctly out of all
instances it applies to

Association Rules
• Association rule
• If beer = yes and crisps = no then nappy = yes
• If beer = yes then nappy = yes and bread = no
27/24
• Like classification rules BUT
• used to infer the value of any attribute (not just
class)
• or a combination of attributes

Association Rules: Examples
• The rule
• If windy = false and play = no
then outlook = sunny and humidity = high
Different from
• If windy = false and play = no then outlook = sunny
• If windy = false and play = no then humidity = high
Due to coverage and accuracy

Decision Trees
• Nodes represents attributes
• Each branch from a node usually represents a single value for that attribute
• but it can compare two values for the attribute
• use a function of one or more attributes
• Each leaf node contains answer to problem
• class
• set of classes or probability distribution
• To solve a problem
• new instance is routed down the tree to find solution

Decision Nodes
• Nominal attribute
• number of branches out of a node is equal to number of attribute
values
• attribute tested at most once on a path
• Numeric attribute
• attribute value is compared (> or <) to a constant.
• or three way split may be used (i.e. 3 branches)
• <,>,= for integer
• below, within, above for real (test against an interval)
• attribute may be tested several times on a path

Decision Tree Example
1st
year inc
bad
<= 2.5 > 2.5
Statutory holidays
1st
year inc
bad
good
good
>10 <= 10
<= 4 > 4
Hours/week
Health plan
bad bad
good
none half full
<=36 >36

Converting Trees to Rules
• Decision trees can be converted into a set of n rules
• n is the number of leaf nodes
• One rule for each path from the root to a leaf
• Conditions: one per node from root to leaf
• Conclusion(s): class(es) assigned by the leaf
• Rules obtained from decision tree are unambiguous and complete
• no classification contradictions
• rules are order-independent
• no classification failures
• BUT rules may be unnecessarily complex
• rule pruning required to remove redundant conditions

Trees to Rules: Example
• if 1st
year inc <= 2.5
then bad
• if 1st
year inc > 2.5
and stat. holidays > 10
then good
• if 1st
year inc > 2.5
and statutory holidays <= 10
and 1st
year inc < = 4
then bad
• if 1st
year inc > 2.5
and stat. holidays <= 10
and 1st
year inc > 4
then good
1st
year inc
bad
good
bad
good
<= 2.5 > 2.5
>10 <= 10
<= 4 > 4
1st
year inc
Statutory holidays
1st
year inc
bad
good
bad
good
<= 2.5 > 2.5
>10 <= 10
<= 4 > 4

Rules to Trees: Example
• If a and b then x
• If c and d then x
a
b c
c d
d
x
x
x
y
y
y
y
y
y
n
n
n
n
n
n

Trees for Numeric Prediction
• Predicting a numeric value not a class
• Regression computes an expression which
calculates a numeric value
• PRP = - 55.9 + 0.0489 cycle time + 0.0153 min memory
+ 0.0056 max memory + 0.641 cache
- 0.27 min channels + 1.48 max channels
• Regression tree
• decision tree where each leaf predicts a numeric value
• value is average of training instances that reach leaf
• Model tree
• regression tree with linear regression model at leaf

Regression Tree
© The Robert
Gordon
chmin
mmax
mmin
chmax
myct
cach
mmax
mmax
cach
64.6
19.3
29.8
59.3
37.3 18.3
75.7 133
157
783
281 492
>28
<=8.5
(8.5,28]
<=2500
(2500,
4250]
>4250
<=0.5
(0.5,8]
<=550 >550
<=7.5
<=1000 >1000
> 7.5
<=28000
> 28000
<=58
> 58
<= 1200 > 1200

Model Tree
• Combines linear regression and regression tree
chmin
mmax
cach
mmax
cach
LM1
LM3
LM2
> 8.5
<=8.5
<=4500 >4250
(0.5,8.5]
<=0.5
<=7.5 > 7.5
<=28000
> 28000
LM5 LM6
LM4
LM1 PRP= 8.29 + 0.004mmax +
2.77chmin
LM2 PRP= 20.3 + 0.004mmin –
3.99chmin
etc

Data mining : Output
• Output: requirements
• Types of output
• Tables
• Trees
• Rules
• Instances
• Clusters
• Summary

Instance-Based Representation
• Simplest form of learning
• Look for instance most similar to new instance
• Lazy Learning
• work is done when problem-solving, not training
• Distance function
• Numeric calculation indicates similarity between two attribute values
• Numeric attributes: difference in values
• Nominal attributes
• 0 if equal, 1 if not
• Or more sophisticated measure (e.g. hue for colours)

Instance-Based
New problem. Class?
Closest solution to new problem is
“blue”
3-neighbour solution to new
problem is “yellow”

Clusters
• Represent groups of instances which are similar
• Some allow overlapping clusters

instances 1 2 3
a 0.4 0.1 0.5
b 0.1 0.8 0.1
c 0.3 0.3 0.4
d 0.1 0.1 0.8
e 0.4 0.2 0.4
…
g a c i e d k b j f h

Summary
• There are lots of different ways of representing the output from a
data mining tool
• Trees – decision trees, regression trees
• Rules – classification, association
• Instances – decision table, nearest neighbour
• Clusters
• Output depends on learning algorithm and input

slide-02-data-mining-Input_output-1.pptx

More Related Content

Similar to slide-02-data-mining-Input_output-1.pptx

Recently uploaded

slide-02-data-mining-Input_output-1.pptx