Introduction to Data Mining
Why use Data Mining?
Lecturer: Abdullahi Ahamad Shehu
(M.Sc. Data Science, M.Sc. Computer Science)
Office: Faculty of Computing Extension
10 March 2025 2
Content
• Input
• Output
Examples of Concepts
• How to decide whether there is an attempt to intrude in the network.
• How to determine whether somebody has a specific illness
• How to decide whether there is credit card misuse
• How to conclude whether a contract is good or not
• How to predict computer performance
• How to determine which products people buy together
• Which groupings can be established from a set of examples
Types of Concepts
• Classification
• learn to classify unclassified examples from classified ones
• e.g. how to decide whether to give a loan
• Association learning
• learn associations between attributes
• e.g. what supermarket products people buy together
• Clustering
• group examples together
• e.g. given a set of documents, divide them into groups
• Numeric prediction
• the output to be learned is numeric
• e.g. calculate the price of a car
Concept Description
Outlook Temp Humidity Windy Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No Yes
Rainy Mild Normal No Yes
If outlook = sunny and humidity = high
then play = no
If outlook = rainy and wind = yes
then play = no
If outlook = cloudy
then play = yes
If humidity = normal
then play = yes
If none of the above rules applies
then play = yes
Concept description:
output of our data mining tool
Instances (or Examples)
Outlook Temp Humidity Windy Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No Yes
Rainy Mild Normal No Yes
• Instance: a single example of a concept
• described by a set of attributes (features, columns)
• Input to learning algorithm
• set of instances
• usually described as a single relation/flat file.
instances
Attributes (or features)
• Attribute: describes a specific characteristic of an instance
• e.g. age, salary, …
• Attributes are often predefined for a set of instances
• an instance is described by its attribute values
• e.g. 25, 20567, …
Outlook Temp Humidity Windy Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No Yes
Rainy Mild Normal No Yes
attribute
attribute
value
Problems with Attributes
• Not all instances have values for all attributes
• e.g. patient’s family history is unknown
• Existence of an attribute may depend on value of another attribute.
• e.g. attribute pregnant conditional on gender = female
• Not all attributes are important
• E.g. person’s nose shape vs. whether to give them a loan
• Need feature selection to identify the important ones
Types of Attribute
• Nominal
• values are symbolic, e.g. desk, table, bed, wardrobe
• no relation between nominal values
• Boolean attributes are a special case
• 0 and 1 or True and False
• also called categorical, enumerated or discrete
• Ordinal
• values are ordered, e.g. small, medium, large, x-large
• small < medium < large < x-large
• but difference between 2 values is not meaningful
Types of Attribute
• Interval
• quantities are ordered
• measured in fixed equal units, years 2001, 2002, 2003, 2004,
• difference between values meaningful: 2005 - 2004
• but sum or product is not meaningful: 2005 + 2004
• Ratio
• Quantities include a natural zero
• Money: 0, 10, 100, 1000
• treated as real numbers because all mathematical operations
are meaningful
Preparing the Input
• Need to obtain a dataset in ‘correct format’
• Possible when there is a limited set of finite relations
outlook temp humidity windy play?
sunny 85 85 no no
sunny 80 90 yes no
cloudy 83 86 no yes
rainy 70 96 no yes
instances
attributes class
attribute values
Preparing the Input
• For example, create data in Excel
• save as .csv file
• Values of attributes
• Temperature, Humidity: numeric
• Outlook: Sunny, Cloudy, Rainy
• Windy, Play?: Yes, No
outlook temp humidity windy play?
sunny 85 85 no no
sunny 80 90 yes no
cloudy 83 86 no yes
rainy 70 96 no yes
Problems Preparing the Input
• Data may come from different sources
• e.g. different departments within a company
• Variation in record keeping
• Style
• Data aggregation (hourly, weekly, monthly etc.)
• Synonyms
• Errors
• Data must be assembled, integrated, aggregated and
cleaned
Preparing data
• Wrangling - transforming data into another format to make it more
suitable and valuable for a task
• Cleansing (cleaning) - detecting and correcting errors in the data.
• Scraping - automatic extraction of data from a data source.
• Integration - combining data from several disparate sources into a
(useful) dataset
14/14
Missing data
• Missing data may be unknown, unrecorded, irrelevant
• Causes
• Equipment faults
• Difficult to acquire (e.g. age, income)
• Measurement is not possible
• The fact that a value is missing may be informative
• e.g. missing test in medical examination
• BUT this is NOT usually the case
• Represented in R as NA
Inaccurate Values
• Errors and omissions which do not affect original purpose of data
collection
• E.g. age of bank customers not important
• E.g. Customers IDs not important
• Typographical errors in nominal attributes
• e.g. Pepsi vs Pepsi-cola
• Deliberate errors
• E.g. People may lie about their mental health history
• Duplicates
• ML algorithms very sensitive to this
Summary
• Preparing data for input is difficult and demanding
• data may need assembling, integrating, aggregating and cleaning
• if data set is huge, a sample may be used
• need relation, attributes and data (or instances)
• Various types of data may be used
• Nominal and numeric are most common
10 March 2025 18
Content
• Input
• Output
Data mining : Output
• Output: requirements
• Types of output
• Tables
• Rules
• Trees
• Instances
• Clusters
• Summary
Understanding the Output
• Output must be easy to understand
• representation of output is key
• Representation is NOT independent from learning process
• learning algorithm used determines representation of output
• depends on type of algorithm
Representations
• Decision tables
• Classification rules
• Association rules
• Decision trees
• Regression Trees for numeric prediction
• Instance-based representation
• Clusters
Data mining : Output
• Output: requirements
• Types of output
• Tables
• Rules
• Trees
• Instances
• Clusters
• Summary
Decision tables
• Uses same format as input but only uses selected attributes
• Challenge: choosing selected attributes
• Decisions
• if outlook = sunny
and humidity = high
then play = no
• etc
outlook humidity play
sunny high no
sunny normal yes
cloudy high yes
…
Classification Rules
• IF conditions THEN conclusion
• if outlook = sunny
and humidity > 83
then play = no
• Conditions
• tests that have to be true for the rule to apply
• usually several conditions connected by ANDs
• Conclusion
• solution to the problem
• class, set of classes or probability distribution
Classification Rules: problems
• Rules may contradict each other
• Two applicable rules may give different classifications
• Decision List is an ordered set of rules
• first satisfied rule should be applied
• rule only applied if preceding ones are not applicable
• no contradictions in classification
• Rules may fail to classify an instance
• Decision List may have a final classification rule with no
conditions
• no classification failures
If outlook=sunny & humidity=high then play=no
If outlook=rainy & windy=true then play=no
...
If none of the above then play=yes
Association Rules
• Like classification rules BUT
• used to infer the value of any attribute (not just class)
• or a combination of attributes
• NOT intended to be used together as a set
• different association rules determine different things
• Problem: many different association rules can be derived from a small
dataset
• Restrict to associations with
• high support: number of instances it predicts correctly
• high confidence (accuracy): proportion of instances it predicts correctly out of all
instances it applies to
Association Rules
• Association rule
• If beer = yes and crisps = no then nappy = yes
• If beer = yes then nappy = yes and bread = no
27/24
• Like classification rules BUT
• used to infer the value of any attribute (not just
class)
• or a combination of attributes
Association Rules: Examples
• The rule
• If windy = false and play = no
then outlook = sunny and humidity = high
Different from
• If windy = false and play = no then outlook = sunny
• If windy = false and play = no then humidity = high
Due to coverage and accuracy
Data mining : Output
• Output: requirements
• Types of output
• Tables
• Rules
• Trees
• Instances
• Clusters
• Summary
Decision Trees
• Nodes represents attributes
• Each branch from a node usually represents a single value for that attribute
• but it can compare two values for the attribute
• use a function of one or more attributes
• Each leaf node contains answer to problem
• class
• set of classes or probability distribution
• To solve a problem
• new instance is routed down the tree to find solution
Decision Nodes
• Nominal attribute
• number of branches out of a node is equal to number of attribute
values
• attribute tested at most once on a path
• Numeric attribute
• attribute value is compared (> or <) to a constant.
• or three way split may be used (i.e. 3 branches)
• <,>,= for integer
• below, within, above for real (test against an interval)
• attribute may be tested several times on a path
Decision Tree Example
1st
year inc
bad
<= 2.5 > 2.5
Statutory holidays
1st
year inc
bad
good
good
>10 <= 10
<= 4 > 4
Hours/week
Health plan
bad bad
good
none half full
<=36 >36
Converting Trees to Rules
• Decision trees can be converted into a set of n rules
• n is the number of leaf nodes
• One rule for each path from the root to a leaf
• Conditions: one per node from root to leaf
• Conclusion(s): class(es) assigned by the leaf
• Rules obtained from decision tree are unambiguous and complete
• no classification contradictions
• rules are order-independent
• no classification failures
• BUT rules may be unnecessarily complex
• rule pruning required to remove redundant conditions
Trees to Rules: Example
• if 1st
year inc <= 2.5
then bad
• if 1st
year inc > 2.5
and stat. holidays > 10
then good
• if 1st
year inc > 2.5
and statutory holidays <= 10
and 1st
year inc < = 4
then bad
• if 1st
year inc > 2.5
and stat. holidays <= 10
and 1st
year inc > 4
then good
1st
year inc
bad
good
bad
good
<= 2.5 > 2.5
>10 <= 10
<= 4 > 4
1st
year inc
Statutory holidays
1st
year inc
bad
good
bad
good
<= 2.5 > 2.5
>10 <= 10
<= 4 > 4
Rules to Trees: Example
• If a and b then x
• If c and d then x
a
b c
c d
d
x
x
x
y
y
y
y
y
y
n
n
n
n
n
n
Trees for Numeric Prediction
• Predicting a numeric value not a class
• Regression computes an expression which
calculates a numeric value
• PRP = - 55.9 + 0.0489 cycle time + 0.0153 min memory
+ 0.0056 max memory + 0.641 cache
- 0.27 min channels + 1.48 max channels
• Regression tree
• decision tree where each leaf predicts a numeric value
• value is average of training instances that reach leaf
• Model tree
• regression tree with linear regression model at leaf
Regression Tree
© The Robert
Gordon
chmin
mmax
mmin
chmax
myct
cach
mmax
mmax
cach
64.6
19.3
29.8
59.3
37.3 18.3
75.7 133
157
783
281 492
>28
<=8.5
(8.5,28]
<=2500
(2500,
4250]
>4250
<=0.5
(0.5,8]
<=550 >550
<=7.5
<=1000 >1000
> 7.5
<=28000
> 28000
<=58
> 58
<= 1200 > 1200
Model Tree
• Combines linear regression and regression tree
chmin
mmax
cach
mmax
cach
LM1
LM3
LM2
> 8.5
<=8.5
<=4500 >4250
(0.5,8.5]
<=0.5
<=7.5 > 7.5
<=28000
> 28000
LM5 LM6
LM4
LM1 PRP= 8.29 + 0.004mmax +
2.77chmin
LM2 PRP= 20.3 + 0.004mmin –
3.99chmin
etc
Data mining : Output
• Output: requirements
• Types of output
• Tables
• Trees
• Rules
• Instances
• Clusters
• Summary
Instance-Based Representation
• Simplest form of learning
• Look for instance most similar to new instance
• Lazy Learning
• work is done when problem-solving, not training
• Distance function
• Numeric calculation indicates similarity between two attribute values
• Numeric attributes: difference in values
• Nominal attributes
• 0 if equal, 1 if not
• Or more sophisticated measure (e.g. hue for colours)
Instance-Based
New problem. Class?
Closest solution to new problem is
“blue”
3-neighbour solution to new
problem is “yellow”
Clusters
• Represent groups of instances which are similar
• Some allow overlapping clusters
instances 1 2 3
a 0.4 0.1 0.5
b 0.1 0.8 0.1
c 0.3 0.3 0.4
d 0.1 0.1 0.8
e 0.4 0.2 0.4
…
g a c i e d k b j f h
Summary
• There are lots of different ways of representing the output from a
data mining tool
• Trees – decision trees, regression trees
• Rules – classification, association
• Instances – decision table, nearest neighbour
• Clusters
• Output depends on learning algorithm and input

slide-02-data-mining-Input_output-1.pptx

  • 1.
    Introduction to DataMining Why use Data Mining? Lecturer: Abdullahi Ahamad Shehu (M.Sc. Data Science, M.Sc. Computer Science) Office: Faculty of Computing Extension
  • 2.
    10 March 20252 Content • Input • Output
  • 3.
    Examples of Concepts •How to decide whether there is an attempt to intrude in the network. • How to determine whether somebody has a specific illness • How to decide whether there is credit card misuse • How to conclude whether a contract is good or not • How to predict computer performance • How to determine which products people buy together • Which groupings can be established from a set of examples
  • 4.
    Types of Concepts •Classification • learn to classify unclassified examples from classified ones • e.g. how to decide whether to give a loan • Association learning • learn associations between attributes • e.g. what supermarket products people buy together • Clustering • group examples together • e.g. given a set of documents, divide them into groups • Numeric prediction • the output to be learned is numeric • e.g. calculate the price of a car
  • 5.
    Concept Description Outlook TempHumidity Windy Play? Sunny Hot High No No Sunny Hot High Yes No Cloudy Hot High No Yes Rainy Mild Normal No Yes If outlook = sunny and humidity = high then play = no If outlook = rainy and wind = yes then play = no If outlook = cloudy then play = yes If humidity = normal then play = yes If none of the above rules applies then play = yes Concept description: output of our data mining tool
  • 6.
    Instances (or Examples) OutlookTemp Humidity Windy Play? Sunny Hot High No No Sunny Hot High Yes No Cloudy Hot High No Yes Rainy Mild Normal No Yes • Instance: a single example of a concept • described by a set of attributes (features, columns) • Input to learning algorithm • set of instances • usually described as a single relation/flat file. instances
  • 7.
    Attributes (or features) •Attribute: describes a specific characteristic of an instance • e.g. age, salary, … • Attributes are often predefined for a set of instances • an instance is described by its attribute values • e.g. 25, 20567, … Outlook Temp Humidity Windy Play? Sunny Hot High No No Sunny Hot High Yes No Cloudy Hot High No Yes Rainy Mild Normal No Yes attribute attribute value
  • 8.
    Problems with Attributes •Not all instances have values for all attributes • e.g. patient’s family history is unknown • Existence of an attribute may depend on value of another attribute. • e.g. attribute pregnant conditional on gender = female • Not all attributes are important • E.g. person’s nose shape vs. whether to give them a loan • Need feature selection to identify the important ones
  • 9.
    Types of Attribute •Nominal • values are symbolic, e.g. desk, table, bed, wardrobe • no relation between nominal values • Boolean attributes are a special case • 0 and 1 or True and False • also called categorical, enumerated or discrete • Ordinal • values are ordered, e.g. small, medium, large, x-large • small < medium < large < x-large • but difference between 2 values is not meaningful
  • 10.
    Types of Attribute •Interval • quantities are ordered • measured in fixed equal units, years 2001, 2002, 2003, 2004, • difference between values meaningful: 2005 - 2004 • but sum or product is not meaningful: 2005 + 2004 • Ratio • Quantities include a natural zero • Money: 0, 10, 100, 1000 • treated as real numbers because all mathematical operations are meaningful
  • 11.
    Preparing the Input •Need to obtain a dataset in ‘correct format’ • Possible when there is a limited set of finite relations outlook temp humidity windy play? sunny 85 85 no no sunny 80 90 yes no cloudy 83 86 no yes rainy 70 96 no yes instances attributes class attribute values
  • 12.
    Preparing the Input •For example, create data in Excel • save as .csv file • Values of attributes • Temperature, Humidity: numeric • Outlook: Sunny, Cloudy, Rainy • Windy, Play?: Yes, No outlook temp humidity windy play? sunny 85 85 no no sunny 80 90 yes no cloudy 83 86 no yes rainy 70 96 no yes
  • 13.
    Problems Preparing theInput • Data may come from different sources • e.g. different departments within a company • Variation in record keeping • Style • Data aggregation (hourly, weekly, monthly etc.) • Synonyms • Errors • Data must be assembled, integrated, aggregated and cleaned
  • 14.
    Preparing data • Wrangling- transforming data into another format to make it more suitable and valuable for a task • Cleansing (cleaning) - detecting and correcting errors in the data. • Scraping - automatic extraction of data from a data source. • Integration - combining data from several disparate sources into a (useful) dataset 14/14
  • 15.
    Missing data • Missingdata may be unknown, unrecorded, irrelevant • Causes • Equipment faults • Difficult to acquire (e.g. age, income) • Measurement is not possible • The fact that a value is missing may be informative • e.g. missing test in medical examination • BUT this is NOT usually the case • Represented in R as NA
  • 16.
    Inaccurate Values • Errorsand omissions which do not affect original purpose of data collection • E.g. age of bank customers not important • E.g. Customers IDs not important • Typographical errors in nominal attributes • e.g. Pepsi vs Pepsi-cola • Deliberate errors • E.g. People may lie about their mental health history • Duplicates • ML algorithms very sensitive to this
  • 17.
    Summary • Preparing datafor input is difficult and demanding • data may need assembling, integrating, aggregating and cleaning • if data set is huge, a sample may be used • need relation, attributes and data (or instances) • Various types of data may be used • Nominal and numeric are most common
  • 18.
    10 March 202518 Content • Input • Output
  • 19.
    Data mining :Output • Output: requirements • Types of output • Tables • Rules • Trees • Instances • Clusters • Summary
  • 20.
    Understanding the Output •Output must be easy to understand • representation of output is key • Representation is NOT independent from learning process • learning algorithm used determines representation of output • depends on type of algorithm
  • 21.
    Representations • Decision tables •Classification rules • Association rules • Decision trees • Regression Trees for numeric prediction • Instance-based representation • Clusters
  • 22.
    Data mining :Output • Output: requirements • Types of output • Tables • Rules • Trees • Instances • Clusters • Summary
  • 23.
    Decision tables • Usessame format as input but only uses selected attributes • Challenge: choosing selected attributes • Decisions • if outlook = sunny and humidity = high then play = no • etc outlook humidity play sunny high no sunny normal yes cloudy high yes …
  • 24.
    Classification Rules • IFconditions THEN conclusion • if outlook = sunny and humidity > 83 then play = no • Conditions • tests that have to be true for the rule to apply • usually several conditions connected by ANDs • Conclusion • solution to the problem • class, set of classes or probability distribution
  • 25.
    Classification Rules: problems •Rules may contradict each other • Two applicable rules may give different classifications • Decision List is an ordered set of rules • first satisfied rule should be applied • rule only applied if preceding ones are not applicable • no contradictions in classification • Rules may fail to classify an instance • Decision List may have a final classification rule with no conditions • no classification failures If outlook=sunny & humidity=high then play=no If outlook=rainy & windy=true then play=no ... If none of the above then play=yes
  • 26.
    Association Rules • Likeclassification rules BUT • used to infer the value of any attribute (not just class) • or a combination of attributes • NOT intended to be used together as a set • different association rules determine different things • Problem: many different association rules can be derived from a small dataset • Restrict to associations with • high support: number of instances it predicts correctly • high confidence (accuracy): proportion of instances it predicts correctly out of all instances it applies to
  • 27.
    Association Rules • Associationrule • If beer = yes and crisps = no then nappy = yes • If beer = yes then nappy = yes and bread = no 27/24 • Like classification rules BUT • used to infer the value of any attribute (not just class) • or a combination of attributes
  • 28.
    Association Rules: Examples •The rule • If windy = false and play = no then outlook = sunny and humidity = high Different from • If windy = false and play = no then outlook = sunny • If windy = false and play = no then humidity = high Due to coverage and accuracy
  • 29.
    Data mining :Output • Output: requirements • Types of output • Tables • Rules • Trees • Instances • Clusters • Summary
  • 30.
    Decision Trees • Nodesrepresents attributes • Each branch from a node usually represents a single value for that attribute • but it can compare two values for the attribute • use a function of one or more attributes • Each leaf node contains answer to problem • class • set of classes or probability distribution • To solve a problem • new instance is routed down the tree to find solution
  • 31.
    Decision Nodes • Nominalattribute • number of branches out of a node is equal to number of attribute values • attribute tested at most once on a path • Numeric attribute • attribute value is compared (> or <) to a constant. • or three way split may be used (i.e. 3 branches) • <,>,= for integer • below, within, above for real (test against an interval) • attribute may be tested several times on a path
  • 32.
    Decision Tree Example 1st yearinc bad <= 2.5 > 2.5 Statutory holidays 1st year inc bad good good >10 <= 10 <= 4 > 4 Hours/week Health plan bad bad good none half full <=36 >36
  • 33.
    Converting Trees toRules • Decision trees can be converted into a set of n rules • n is the number of leaf nodes • One rule for each path from the root to a leaf • Conditions: one per node from root to leaf • Conclusion(s): class(es) assigned by the leaf • Rules obtained from decision tree are unambiguous and complete • no classification contradictions • rules are order-independent • no classification failures • BUT rules may be unnecessarily complex • rule pruning required to remove redundant conditions
  • 34.
    Trees to Rules:Example • if 1st year inc <= 2.5 then bad • if 1st year inc > 2.5 and stat. holidays > 10 then good • if 1st year inc > 2.5 and statutory holidays <= 10 and 1st year inc < = 4 then bad • if 1st year inc > 2.5 and stat. holidays <= 10 and 1st year inc > 4 then good 1st year inc bad good bad good <= 2.5 > 2.5 >10 <= 10 <= 4 > 4 1st year inc Statutory holidays 1st year inc bad good bad good <= 2.5 > 2.5 >10 <= 10 <= 4 > 4
  • 35.
    Rules to Trees:Example • If a and b then x • If c and d then x a b c c d d x x x y y y y y y n n n n n n
  • 36.
    Trees for NumericPrediction • Predicting a numeric value not a class • Regression computes an expression which calculates a numeric value • PRP = - 55.9 + 0.0489 cycle time + 0.0153 min memory + 0.0056 max memory + 0.641 cache - 0.27 min channels + 1.48 max channels • Regression tree • decision tree where each leaf predicts a numeric value • value is average of training instances that reach leaf • Model tree • regression tree with linear regression model at leaf
  • 37.
    Regression Tree © TheRobert Gordon chmin mmax mmin chmax myct cach mmax mmax cach 64.6 19.3 29.8 59.3 37.3 18.3 75.7 133 157 783 281 492 >28 <=8.5 (8.5,28] <=2500 (2500, 4250] >4250 <=0.5 (0.5,8] <=550 >550 <=7.5 <=1000 >1000 > 7.5 <=28000 > 28000 <=58 > 58 <= 1200 > 1200
  • 38.
    Model Tree • Combineslinear regression and regression tree chmin mmax cach mmax cach LM1 LM3 LM2 > 8.5 <=8.5 <=4500 >4250 (0.5,8.5] <=0.5 <=7.5 > 7.5 <=28000 > 28000 LM5 LM6 LM4 LM1 PRP= 8.29 + 0.004mmax + 2.77chmin LM2 PRP= 20.3 + 0.004mmin – 3.99chmin etc
  • 39.
    Data mining :Output • Output: requirements • Types of output • Tables • Trees • Rules • Instances • Clusters • Summary
  • 40.
    Instance-Based Representation • Simplestform of learning • Look for instance most similar to new instance • Lazy Learning • work is done when problem-solving, not training • Distance function • Numeric calculation indicates similarity between two attribute values • Numeric attributes: difference in values • Nominal attributes • 0 if equal, 1 if not • Or more sophisticated measure (e.g. hue for colours)
  • 41.
    Instance-Based New problem. Class? Closestsolution to new problem is “blue” 3-neighbour solution to new problem is “yellow”
  • 42.
    Clusters • Represent groupsof instances which are similar • Some allow overlapping clusters
  • 43.
    instances 1 23 a 0.4 0.1 0.5 b 0.1 0.8 0.1 c 0.3 0.3 0.4 d 0.1 0.1 0.8 e 0.4 0.2 0.4 … g a c i e d k b j f h
  • 44.
    Summary • There arelots of different ways of representing the output from a data mining tool • Trees – decision trees, regression trees • Rules – classification, association • Instances – decision table, nearest neighbour • Clusters • Output depends on learning algorithm and input