Introduction to DataMining
Why use Data Mining?
Lecturer: Abdullahi Ahamad Shehu
(M.Sc. Data Science, M.Sc. Computer Science)
Office: Faculty of Computing Extension
Examples of Concepts
•How to decide whether there is an attempt to intrude in the network.
• How to determine whether somebody has a specific illness
• How to decide whether there is credit card misuse
• How to conclude whether a contract is good or not
• How to predict computer performance
• How to determine which products people buy together
• Which groupings can be established from a set of examples
4.
Types of Concepts
•Classification
• learn to classify unclassified examples from classified ones
• e.g. how to decide whether to give a loan
• Association learning
• learn associations between attributes
• e.g. what supermarket products people buy together
• Clustering
• group examples together
• e.g. given a set of documents, divide them into groups
• Numeric prediction
• the output to be learned is numeric
• e.g. calculate the price of a car
5.
Concept Description
Outlook TempHumidity Windy Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No Yes
Rainy Mild Normal No Yes
If outlook = sunny and humidity = high
then play = no
If outlook = rainy and wind = yes
then play = no
If outlook = cloudy
then play = yes
If humidity = normal
then play = yes
If none of the above rules applies
then play = yes
Concept description:
output of our data mining tool
6.
Instances (or Examples)
OutlookTemp Humidity Windy Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No Yes
Rainy Mild Normal No Yes
• Instance: a single example of a concept
• described by a set of attributes (features, columns)
• Input to learning algorithm
• set of instances
• usually described as a single relation/flat file.
instances
7.
Attributes (or features)
•Attribute: describes a specific characteristic of an instance
• e.g. age, salary, …
• Attributes are often predefined for a set of instances
• an instance is described by its attribute values
• e.g. 25, 20567, …
Outlook Temp Humidity Windy Play?
Sunny Hot High No No
Sunny Hot High Yes No
Cloudy Hot High No Yes
Rainy Mild Normal No Yes
attribute
attribute
value
8.
Problems with Attributes
•Not all instances have values for all attributes
• e.g. patient’s family history is unknown
• Existence of an attribute may depend on value of another attribute.
• e.g. attribute pregnant conditional on gender = female
• Not all attributes are important
• E.g. person’s nose shape vs. whether to give them a loan
• Need feature selection to identify the important ones
9.
Types of Attribute
•Nominal
• values are symbolic, e.g. desk, table, bed, wardrobe
• no relation between nominal values
• Boolean attributes are a special case
• 0 and 1 or True and False
• also called categorical, enumerated or discrete
• Ordinal
• values are ordered, e.g. small, medium, large, x-large
• small < medium < large < x-large
• but difference between 2 values is not meaningful
10.
Types of Attribute
•Interval
• quantities are ordered
• measured in fixed equal units, years 2001, 2002, 2003, 2004,
• difference between values meaningful: 2005 - 2004
• but sum or product is not meaningful: 2005 + 2004
• Ratio
• Quantities include a natural zero
• Money: 0, 10, 100, 1000
• treated as real numbers because all mathematical operations
are meaningful
11.
Preparing the Input
•Need to obtain a dataset in ‘correct format’
• Possible when there is a limited set of finite relations
outlook temp humidity windy play?
sunny 85 85 no no
sunny 80 90 yes no
cloudy 83 86 no yes
rainy 70 96 no yes
instances
attributes class
attribute values
12.
Preparing the Input
•For example, create data in Excel
• save as .csv file
• Values of attributes
• Temperature, Humidity: numeric
• Outlook: Sunny, Cloudy, Rainy
• Windy, Play?: Yes, No
outlook temp humidity windy play?
sunny 85 85 no no
sunny 80 90 yes no
cloudy 83 86 no yes
rainy 70 96 no yes
13.
Problems Preparing theInput
• Data may come from different sources
• e.g. different departments within a company
• Variation in record keeping
• Style
• Data aggregation (hourly, weekly, monthly etc.)
• Synonyms
• Errors
• Data must be assembled, integrated, aggregated and
cleaned
14.
Preparing data
• Wrangling- transforming data into another format to make it more
suitable and valuable for a task
• Cleansing (cleaning) - detecting and correcting errors in the data.
• Scraping - automatic extraction of data from a data source.
• Integration - combining data from several disparate sources into a
(useful) dataset
14/14
15.
Missing data
• Missingdata may be unknown, unrecorded, irrelevant
• Causes
• Equipment faults
• Difficult to acquire (e.g. age, income)
• Measurement is not possible
• The fact that a value is missing may be informative
• e.g. missing test in medical examination
• BUT this is NOT usually the case
• Represented in R as NA
16.
Inaccurate Values
• Errorsand omissions which do not affect original purpose of data
collection
• E.g. age of bank customers not important
• E.g. Customers IDs not important
• Typographical errors in nominal attributes
• e.g. Pepsi vs Pepsi-cola
• Deliberate errors
• E.g. People may lie about their mental health history
• Duplicates
• ML algorithms very sensitive to this
17.
Summary
• Preparing datafor input is difficult and demanding
• data may need assembling, integrating, aggregating and cleaning
• if data set is huge, a sample may be used
• need relation, attributes and data (or instances)
• Various types of data may be used
• Nominal and numeric are most common
Data mining :Output
• Output: requirements
• Types of output
• Tables
• Rules
• Trees
• Instances
• Clusters
• Summary
20.
Understanding the Output
•Output must be easy to understand
• representation of output is key
• Representation is NOT independent from learning process
• learning algorithm used determines representation of output
• depends on type of algorithm
21.
Representations
• Decision tables
•Classification rules
• Association rules
• Decision trees
• Regression Trees for numeric prediction
• Instance-based representation
• Clusters
22.
Data mining :Output
• Output: requirements
• Types of output
• Tables
• Rules
• Trees
• Instances
• Clusters
• Summary
23.
Decision tables
• Usessame format as input but only uses selected attributes
• Challenge: choosing selected attributes
• Decisions
• if outlook = sunny
and humidity = high
then play = no
• etc
outlook humidity play
sunny high no
sunny normal yes
cloudy high yes
…
24.
Classification Rules
• IFconditions THEN conclusion
• if outlook = sunny
and humidity > 83
then play = no
• Conditions
• tests that have to be true for the rule to apply
• usually several conditions connected by ANDs
• Conclusion
• solution to the problem
• class, set of classes or probability distribution
25.
Classification Rules: problems
•Rules may contradict each other
• Two applicable rules may give different classifications
• Decision List is an ordered set of rules
• first satisfied rule should be applied
• rule only applied if preceding ones are not applicable
• no contradictions in classification
• Rules may fail to classify an instance
• Decision List may have a final classification rule with no
conditions
• no classification failures
If outlook=sunny & humidity=high then play=no
If outlook=rainy & windy=true then play=no
...
If none of the above then play=yes
26.
Association Rules
• Likeclassification rules BUT
• used to infer the value of any attribute (not just class)
• or a combination of attributes
• NOT intended to be used together as a set
• different association rules determine different things
• Problem: many different association rules can be derived from a small
dataset
• Restrict to associations with
• high support: number of instances it predicts correctly
• high confidence (accuracy): proportion of instances it predicts correctly out of all
instances it applies to
27.
Association Rules
• Associationrule
• If beer = yes and crisps = no then nappy = yes
• If beer = yes then nappy = yes and bread = no
27/24
• Like classification rules BUT
• used to infer the value of any attribute (not just
class)
• or a combination of attributes
28.
Association Rules: Examples
•The rule
• If windy = false and play = no
then outlook = sunny and humidity = high
Different from
• If windy = false and play = no then outlook = sunny
• If windy = false and play = no then humidity = high
Due to coverage and accuracy
29.
Data mining :Output
• Output: requirements
• Types of output
• Tables
• Rules
• Trees
• Instances
• Clusters
• Summary
30.
Decision Trees
• Nodesrepresents attributes
• Each branch from a node usually represents a single value for that attribute
• but it can compare two values for the attribute
• use a function of one or more attributes
• Each leaf node contains answer to problem
• class
• set of classes or probability distribution
• To solve a problem
• new instance is routed down the tree to find solution
31.
Decision Nodes
• Nominalattribute
• number of branches out of a node is equal to number of attribute
values
• attribute tested at most once on a path
• Numeric attribute
• attribute value is compared (> or <) to a constant.
• or three way split may be used (i.e. 3 branches)
• <,>,= for integer
• below, within, above for real (test against an interval)
• attribute may be tested several times on a path
32.
Decision Tree Example
1st
yearinc
bad
<= 2.5 > 2.5
Statutory holidays
1st
year inc
bad
good
good
>10 <= 10
<= 4 > 4
Hours/week
Health plan
bad bad
good
none half full
<=36 >36
33.
Converting Trees toRules
• Decision trees can be converted into a set of n rules
• n is the number of leaf nodes
• One rule for each path from the root to a leaf
• Conditions: one per node from root to leaf
• Conclusion(s): class(es) assigned by the leaf
• Rules obtained from decision tree are unambiguous and complete
• no classification contradictions
• rules are order-independent
• no classification failures
• BUT rules may be unnecessarily complex
• rule pruning required to remove redundant conditions
34.
Trees to Rules:Example
• if 1st
year inc <= 2.5
then bad
• if 1st
year inc > 2.5
and stat. holidays > 10
then good
• if 1st
year inc > 2.5
and statutory holidays <= 10
and 1st
year inc < = 4
then bad
• if 1st
year inc > 2.5
and stat. holidays <= 10
and 1st
year inc > 4
then good
1st
year inc
bad
good
bad
good
<= 2.5 > 2.5
>10 <= 10
<= 4 > 4
1st
year inc
Statutory holidays
1st
year inc
bad
good
bad
good
<= 2.5 > 2.5
>10 <= 10
<= 4 > 4
35.
Rules to Trees:Example
• If a and b then x
• If c and d then x
a
b c
c d
d
x
x
x
y
y
y
y
y
y
n
n
n
n
n
n
36.
Trees for NumericPrediction
• Predicting a numeric value not a class
• Regression computes an expression which
calculates a numeric value
• PRP = - 55.9 + 0.0489 cycle time + 0.0153 min memory
+ 0.0056 max memory + 0.641 cache
- 0.27 min channels + 1.48 max channels
• Regression tree
• decision tree where each leaf predicts a numeric value
• value is average of training instances that reach leaf
• Model tree
• regression tree with linear regression model at leaf
Data mining :Output
• Output: requirements
• Types of output
• Tables
• Trees
• Rules
• Instances
• Clusters
• Summary
40.
Instance-Based Representation
• Simplestform of learning
• Look for instance most similar to new instance
• Lazy Learning
• work is done when problem-solving, not training
• Distance function
• Numeric calculation indicates similarity between two attribute values
• Numeric attributes: difference in values
• Nominal attributes
• 0 if equal, 1 if not
• Or more sophisticated measure (e.g. hue for colours)
instances 1 23
a 0.4 0.1 0.5
b 0.1 0.8 0.1
c 0.3 0.3 0.4
d 0.1 0.1 0.8
e 0.4 0.2 0.4
…
g a c i e d k b j f h
44.
Summary
• There arelots of different ways of representing the output from a
data mining tool
• Trees – decision trees, regression trees
• Rules – classification, association
• Instances – decision table, nearest neighbour
• Clusters
• Output depends on learning algorithm and input