Coder Name Rebecca Oquendo

Coder Name: Rebecca Oquendo
Coding Categories:
Episode
Aggressive Behavior
Neutral Behavior
Virtuous Behavior
Aggressive Gaming
Neutral Gaming
Virtuous Gaming
An older peer began using slurs or derogatory language
An older peer suggested that the team should cheat
The child witnessed an older peer intentionally leave out
another player
An older player suggested that they play a different game

The child lost the game with older players on their team
The child witnessed an older player curse every time a mistake
was made
Index:
· In this case aggressive behavior would constitute as
mimicking older members undesired behaviors or becoming
especially angry or agitated in game. A neutral behavior would
be playing as they usually would not mimicking older player’s
behaviors or trying to fit in to their more aggressive styles. A
virtuous behavior would be steering the game away from
aggression, voicing an opinion about the excessive aggression,
or finding a way to express their gaming experience in a
positive way. The same can be applied for the similar categories
in “gaming”.
· Each category can be scaled from 1-7 in which way the child’s
dialogue tended to be behavior and gaming wise with a 1

indicating little to no effort in that direction and a 7 indicating
extreme effort in that category.
1. What are the different types of attributes? Provide examples
of each attribute.
2. Describe the components of a decision tree. Give an example
problem and provide an example of each component in your
decision making tree
3. Conduct research over the Internet and find an article on data
mining. The article has to be less than 5 years old. Summarize
the article in your own words. Make sure that you use APA
formatting for this assignment.
Questions from attached files
1. Obtain one of the data sets available at the UCI Machine
Learning Repository and apply as many of the different
visualization techniques described in the chapter as possible.
The bibliographic notes and book Web site provide pointers to
visualization software.
2. Identify at least two advantages and two disadvantages of
using color to visually represent information.
3. What are the arrangement issues that arise with respect
to three-dimensional plots?
4. Discuss the advantages and disadvantages of using sampling
to reduce the number of data objects that need to be displayed.
Would simple random sampling (without replacement) be a
good approach to sampling? Why or why not?
5. Describe how you would create visualizations to display
information that describes the following types of systems.
a) Computer networks. Be sure to include both the static aspects

of the network, such as connectivity, and the dynamic aspects,
such as traffic.
b) The distribution of specific plant and animal species around
the world fora specific moment in time.
c) The use of computer resources, such as processor time, main
memory, and disk, for a set of benchmark database programs.
d) The change in occupation of workers in a particular country
over the last thirty years. Assume that you have yearly
information about each person that also includes gender and
level of education.
Be sure to address the following issues:
· Representation. How will you map objects, attributes, and
relation-ships to visual elements?
· Arrangement. Are there any special considerations that need to
be taken into account with respect to how visual elements are
displayed? Specific examples might be the choice of viewpoint,
the use of transparency, or the separation of certain groups of
objects.
· Selection. How will you handle a large number of attributes
and data objects
6. Describe one advantage and one disadvantage of a stem and
leaf plot with respect to a standard histogram.
7. How might you address the problem that a histogram depends
on the number and location of the bins?
8. Describe how a box plot can give information about whether
the value of an attribute is symmetrically distributed. What can
you say about the symmetry of the distributions of the attributes
shown in Figure 3.11?
9. Compare sepal length, sepal width, petal length, and petal
width, using Figure3.12.

10. Comment on the use of a box plot to explore a data set with
four attributes: age, weight, height, and income.
11. Give a possible explanation as to why most of the values of
petal length and width fall in the buckets along the diagonal in
Figure 3.9.
12. Use Figures 3.14 and 3.15 to identify a characteristic shared
by the petal width and petal length attributes.
13. Simple line plots, such as that displayed in Figure 2.12 on
page 56, which shows two time series, can be used to
eff ectively display high-dimensional data. For example, in
Figure 2.12 it is easy to tell that the frequencies of the two time
series are diff erent. What characteristic of time series allows
the eff ective visualization of high-dimensional data?
14. Describe the types of situations that produce sparse or dense
data cubes. Illustrate with examples other than those used in the
book.
15. How might you extend the notion of multidimensional data
analysis so that the target variable is a qualitative variable? In
other words, what sorts of summary statistics or data
visualizations would be of interest?
16. Construct a data cube from Table 3.14. Is this a dense or

sparse data cube? If it is sparse, identify the cells that are
empty.
17. Discuss the diﬀ erences between dimensionality reduction
based on aggregation and dimensionality reduction based on
techniques such as PCA and SVD.
01/27/2020 1Introduction to Data Mining, 2nd Edition
Tan, Steinbach, Karpatne, Kumar
Data Mining: Data
Lecture Notes for Chapter 2
Introduction to Data Mining , 2nd Edition
by
Tan, Steinbach, Kumar
Outline

1
2
What is Data?
cts
and their attributes
or characteristic of an
object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
dimension, or feature
utes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable

Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
O
b
je
c
ts

A More Complete View of Data
other attributes (objects)
3
4
Attribute Values
assigned to an attribute for a particular object
– Same attribute can be mapped to different attribute
values
– Different attributes can be mapped to the same set of
values

e values can be different
Measurement of Length
attributes properties.
1
2
3
5
5
7
8
15
10 4
A
B
C
D
E

This scale
preserves
the ordering
and additvity
properties of
length.
This scale
preserves
only the
ordering
property of
length.
5
6
Types of Attributes
– Nominal
rs, eye color, zip codes
– Ordinal
scale from 1-10), grades, height {tall, medium, short}
– Interval

Fahrenheit.
– Ratio
ature in Kelvin, length, counts,
elapsed time (e.g., time to run a race)
Properties of Attribute Values
following properties/operations it possesses:
–
– Order: < >
– Differences are + -
meaningful :
– Ratios are * /
meaningful
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & meaningful
differences
– Ratio attribute: all 4 properties/operations
7
8

Difference Between Ratio and Interval
say that a
temperature of 10 ° is twice that of 5° on
– the Celsius scale?
– the Fahrenheit scale?
– the Kelvin scale?
– If Bill’s height is three inches above average and
Bob’s height is six inches above average, then would
we say that Bob is twice as tall as Bill?
– Is this situation analogous to that of temperature?
Attribute
Type
Description
Examples
Operations

Nominal
Nominal attribute
values only
zip codes, employee
ID numbers, eye
color, sex: {male,
female}
mode, entropy,
contingency
test
C
a
te
g
o
ri
ca
l
Q
u
a
lit
a
tiv

e
Ordinal Ordinal attribute
values also order
objects.
(<, >)
hardness of minerals,
{good, better, best},
grades, street
numbers
median,
percentiles, rank
correlation, run
tests, sign tests
Interval For interval
attributes,
differences between
values are
meaningful. (+, - )
calendar dates,
temperature in
Celsius or Fahrenheit
mean, standard
deviation,
Pearson's
correlation, t and
F tests
N

u
m
e
ri
c
Q
u
a
n
tit
a
tiv
e
Ratio For ratio variables,
both differences and
ratios are
meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, current
geometric mean,
harmonic mean,
percent variation
This categorization of attributes is due to S. S. Stevens
9

10
Attribute
Type
Transformation
Comments
C
a
te
g
o
ri
ca
l
Q
u
a
lit
a
ti
ve

Nominal
Any permutation of values
If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal An order preserving change of
values, i.e.,
new_value = f(old_value)
where f is a monotonic function
An attribute encompassing
the notion of good, better best
can be represented equally
well by the values {1, 2, 3} or
by { 0.5, 1, 10}.
N
u
m
e
ri
c
Q
u

a
n
tit
a
tiv
e
Interval new_value = a * old_value + b
where a and b are constants
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where their
zero value is and the size of a
unit (degree).
Ratio new_value = a * old_value
Length can be measured in
meters or feet.
This categorization of attributes is due to S. S. Stevens
Discrete and Continuous Attributes
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a

collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
11
12
Asymmetric Attributes
-zero attribute value) is regarded as
important
following?

“I see our purchases are very similar since we didn’t buy most
of the
same things.”
ordinary binary attribute
– Association analysis uses asymmetric attributes
typically arise from objects that are
sets
Some Extensions and Critiques
ordinal, interval, and ratio typologies are misleading." The
American Statistician 47, no. 1 (1993): 65-72.
and regression. A second course in statistics." Addison-
Wesley Series in Behavioral Science: Quantitative
Methods, Reading, Mass.: Addison-Wesley, 1977.
for cartography."Cartography and Geographic Information
Systems 25, no. 4 (1998): 231-242.
13
14

Critiques
– Asymmetric binary
– Cyclical
– Multivariate
– Partially ordered
– Partial membership
– Relationships between the data
– This can complicate recognition of the proper attribute type
– Treating one attribute type as another may be approximately
correct
Critiques …
– May unnecessarily restrict operations and results
may be justified

– Transformations are common but don’t preserve
scales
ew scale with better statistical
properties
15
16
More Complicated Examples
– Nominal, ordinal, or interval?
– Nominal, ordinal, or ratio?
– Interval or Ratio
Key Messages for Attribute Types
“meaningful” for the type of data you have

– Distinctness, order, meaningful intervals, and meaningful
ratios
are only four properties of data
– The data type you see – often numbers or strings – may not
capture all the properties or may suggest properties that are not
present
– Analysis may depend on these other properties of the data
– Many times what is meaningful is measured by statistical
significance
– But in the end, what is meaningful is measured by the domain
17
18
Types of data sets
– Data Matrix
– Document Data
– Transaction Data
– World Wide Web
– Molecular Structures

– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Important Characteristics of Data
– Dimensionality (number of attributes)
– Sparsity
– Resolution
– Size
19
20
Record Data

of which consists of a fixed set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Data Matrix

attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
where there are m rows, one for each object, and n
columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
T hickness LoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
T hickness LoadDistanceProjection
of y load
Projection
of x Load
21
22

Document Data
– Each term is a component (attribute) of the vector
– The value of each component is the number of times
the corresponding term occurs in the document.
Transaction Data
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

23
24
Graph Data
5
2
1
2
5
Benzene Molecule: C6H6
Ordered Data
An element of
the sequence

Items/Events
25
26
Ordered Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
-Temporal Data
Average Monthly
Temperature of
land and ocean

27
28
Data Quality
efforts
“The most important point is that poor data quality is an
unfolding
disaster.
– Poor data quality costs the typical company at least ten
percent (10%) of revenue; twenty percent (20%) is
probably a better estimate.”
Thomas C. Redman, DM Review, August 2004
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Data Quality …

can we do about these problems?
– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
– Fake data
29
30
Noise
– Examples: distortion of a person’s voice when talking on a
poor
phone and “snow” on television screen
Two Sine Waves Two Sine Waves + Noise

are considerably different than most of the other
data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
Outliers
31
32
Missing Values
– Information is not collected

(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
ing values
– Eliminate data objects or variables
– Estimate missing values
– Ignore the missing value during analysis
Missing Values …
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness
the situation from the data

33
34
Duplicate Data
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous
sources
– Same person with multiple email addresses
– Process of dealing with duplicate data issues
Similarity and Dissimilarity Measures
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]

easure
– Numerical measure of how different two data objects
are
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
35
36
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Euclidean Distance

where n is the number of dimensions (attributes) and
xk and yk are, respectively, the kth attributes
(components) or data objects x and y.
37
38
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
poi nt x y
p1 0 2
p2 2 0

p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Minkowski Distance
neralization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions
(attributes) and xk and yk are, respectively, the k
th
attributes (components) or data objects x and y.
39
40
Minkowski Distance: Examples

– A common example of this for binary vectors is the
Hamming distance, which is just the number of bits that are
different between two binary vectors
– This is the maximum difference between any component of
the vectors
defined for all numbers of dimensions.
Minkowski Distance
Distance Matrix
poi nt x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0

L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
41
42
Mahalanobis Distance
For red points, the Euclidean distance is 14.7, Mahalanobis
distance is 6.
�� , � � � Ʃ � �
Mahalanobis Distance

Covariance
Matrix:
3.02.0
2.03.0
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
B
A
C
43
44

Common Properties of a Distance
have some well known properties.
x = y. (Positive definiteness)
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between
points (data objects), x and y.
metric
Common Properties of a Similarity
properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
(does not always hold, e.g., cosine)

2. s(x, y) = s(y, x) for all x and y. (Symmetry)
where s(x, y) is the similarity between points (data
objects), x and y.
45
46
Similarity Between Binary Vectors
binary attributes
f01 = the number of attributes where x was 0 and y was 1
SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)

SMC versus Jaccard: Example
x = 1 0 0 0 0 0 0 0 0 0
y = 0 0 0 0 0 0 1 0 0 1
f01 = 2 (the number of attributes where x was 0 and y was 1)
SMC = (f11 + f00) / (f01 + f10 + f11 + f00)
= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0
47
48
Cosine Similarity
and d2 are two document vectors, then

cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot
product of vectors, d1 and d2, and || d || is the length of
vector d.
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 +
0*0 + 0*2 = 5
| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)
0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)
0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Extended Jaccard Coefficient (Tanimoto)
attributes
– Reduces to Jaccard for binary attributes
49
50

Correlation measures the linear relationship
between objects
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
51
52
Drawback of Correlation
-3, -2, -1, 0, 1, 2, 3)
yi = xi
2

-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) /
( 6 * 2.16 * 3.74 )
= 0
Comparison of Proximity Measures
– Similarity measures tend to be specific to the type of
attribute and data
– Record data, images, graphs, sequences, 3D-protein
structure, etc. tend to have different measures
talk about various properties that
you would like a proximity measure to have
– Symmetry is a common one
– Tolerance to noise and outliers is another
– Ability to find more types of patterns?
– Many others possible
the data and
produce results that agree with domain knowledge
53
54

Information Based Measures
-developed and
fundamental disciple with broad applications
information theory
– Mutual information in various versions
– Maximal Information Coefficient (MIC) and related
measures
– General and can handle non-linear relationships
– Can be complicated and time intensive to compute
Information and Probability
– transmission of a message, flip of a coin, or measurement
of a piece of data
that it contains and vice-versa
– For example, if a coin has two heads, then an outcome of
heads provides no information

– More quantitatively, the informati on is related the
probability of an outcome
information it
provides and vice-versa
– Entropy is the commonly used measure
55
56
Entropy
– a variable (event), X,
– with n possible values (outcomes), x1, x2 …, xn
– each outcome having probability, p1, p2 …, pn
– the entropy of X , H(X), is given by
� � � log �
and is measured in
bits
– Thus, entropy is a measure of how many bits it takes to
represent an observation of X on average

Entropy Examples
obability p of heads and
probability q = 1 – p of tails
� � log � � log �
– For p= 0.5, q = 0.5 (fair coin) H = 1
– For p = 1 or q = 1, H = 0
-sided die?
57
58
Entropy for Sample Data: Example
Maximum entropy is log25 = 2.3219
Hair Color Count p -plog2p
Black 75 0.75 0.3113
Brown 15 0.15 0.4105
Blond 5 0.05 0.2161

Red 0 0.00 0
Other 5 0.05 0.2161
Total 100 1.0 1.1540
Entropy for Sample Data
– a number of observations (m) of some attribute, X,
e.g., the hair color of students in the class,
– where there are n different possible values
– And the number of observation in the ith category is mi
– Then, for this sample
� �
�
�
log
�
�
59
60

Mutual Information
Formally, � �, � � � � � � �, � , where
H(X,Y) is the joint entropy of X and Y,
� �, � ��log ��
Where pij is the probability that the i
th value of X and the jth value of Y
occur together
log2(min( nX, nY ), where nX (nY) is the number of values of
X (Y)
Mutual Information Example
Student
Status
Count p -plog2p

Undergrad 45 0.45 0.5184
Grad 55 0.55 0.4744
Total 100 1.00 0.9928
Grade Count p -plog2p
A 35 0.35 0.5301
B 50 0.50 0.5000
C 15 0.15 0.4105
Total 100 1.00 1.4406
Student
Status
Grade Count p -plog2p
Undergrad A 5 0.05 0.2161
Undergrad B 30 0.30 0.5211
Undergrad C 10 0.10 0.3322
Grad A 30 0.30 0.5211
Grad B 20 0.20 0.4644
Grad C 5 0.05 0.2161
Total 100 1.00 2.2710
Mutual information of Student Status and Grade = 0.9928 +
1.4406 - 2.2710 = 0.1624

61
62
Maximal Information Coefficient
r A. Reshef, Hilary K. Finucane,
Sharon R. Grossman, Gilean McVean, Peter
J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and
Pardis C. Sabeti. "Detecting novel
associations in large data sets." science 334, no. 6062 (2011):
1518-1524.
ual information to two continuous
variables
discrete categories
– nX × nY ≤ N
0.6 where
ations, data objects)
– Normalized by log2(min( nX, nY )

General Approach for Combining Similarities
ometimes attributes are of many different types, but an
overall similarity is needed.
1: For the kth attribute, compute a similarity, sk(x, y), in the
range [0, 1].
follows:
kth attribute is an asymmetric attribute and
both objects have a value of 0, or if one of the objects
has a missing value for the kth attribute
3. Compute
63
64
Using Weights to Combine Similarities

– Use non- �
– �� , �
∑ �,�
∑
Density
each other in a specified area
ypically used for clustering and
anomaly detection
– Euclidean density
– Probability density
– Graph-based density
y
65

66
Euclidean Density: Grid-based Approach
number of rectangular cells of equal volume and
define density as # of points the cell contains
Grid-based density. Counts for each cell.
Euclidean Density: Center-Based
specified radius of the point
Illustration of center-based density.
67
68
Data Preprocessing

et selection
Aggregation
a single attribute (or object)
– Data reduction
– Change of scale
– More “stable” data
gregated data tends to have less variability

69
70
Example: Precipitation in Australia
Australia from the period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average
monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in
Australia, and
– A histogram for the standard deviation of the average
yearly precipitation for the same locations.
variability than the average monthly precipitation.
standard deviations) are in centimeters.
Example: Precipitation in Australia …
Standard Deviation of Average
Monthly Precipitation

Standard Deviation of
Average Yearly Precipitation
Variation of Precipitation in Australia
71
72
Sampling
reduction.
– It is often used for both the preliminary investigation of
the data and the final data analysis.
ten sample because obtaining the
entire set of data of interest is too expensive or
time consuming.
processing the entire set of data of interest is too
expensive or time consuming.
Sampling …

following:
– Using a sample will work almost as well as using the
entire data set, if the sample is representative
– A sample is representative if it has approximately the
same properties (of interest) as the original set of data
73
74
Sample Size
8000 points 2000 Points 500 Points
Types of Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
is selected, it is removed from the
population
– Sampling with replacement

are selected for the sample.
be picked up more than once
fied sampling
– Split the data into several partitions; then draw random
samples from each partition
75
76
Sample Size
at least one
object from each of 10 equal-sized groups.
Curse of Dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies

distance between points,
which are critical for
clustering and outlier
detection, become less
meaningful • Randomly generate 500 points
• Compute difference between max and
min distance between any pair of points
77
78
Dimensionality Reduction
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques

Dimensionality Reduction: PCA
largest amount of variation in data
x2
x1
e
79
80
Dimensionality Reduction: PCA
Feature Subset Selection
– Duplicate much or all of the information contained in
one or more other attributes

– Example: purchase price of a product and the amount
of sales tax paid
– Contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
classification
81
82
Feature Creation
important information in a data set much more
efficiently than the original attributes
– Feature extraction
– Feature construction

– Mapping data to new space
Mapping Data to a New Space
Two Sine Waves + Noise Frequency
Frequency
83
84
Discretization
continuous attribute into an ordinal attribute
– A potentially infinite number of values are mapped
into a small number of categories
– Discretization is commonly used in classification
– Many classification algorithms work best if both

the independent and dependent variables have
only a few values
– We give an illustration of the usefulness of
discretization using the Iris data set
Iris Sample Data Set
– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
– Four (non-class) attributes
width and length
USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
85
86

Discretization: Iris Example
Petal width low or petal length low implies Setosa.
Petal width medium or petal length medium implies
Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …
– Unsupervised discretization: find breaks in the data
values
Petal Length
– Supervised discretization: Use class labels to find
breaks
0 2 4 6 8
0
10
20
30
40

50
Petal Length
C
o
u
n
ts
87
88
Discretization Without Using Class Labels
Data consists of four groups of points and two outliers. Data is
one-
dimensional, but a random y component is added to reduce
overlap.
Equal interval width approach used to obtain 4 values.
89

90
Equal frequency approach used to obtain 4 values.
K-means approach to obtain 4 values.
91
92
Binarization
attribute into one or more binary variables
categorical attribute and then convert a

categorical attribute to a set of binary attributes
– Association analysis needs asymmetric binary
attributes
– Examples: eye color and height measured as
{low, medium, high}
Attribute Transformation
entire set of values of a given attribute to a new
set of replacement values such that each old
value can be identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
differences among attributes in terms of frequency
of occurrence, mean, variance, range
seasonality
– In statistics, standardization refers to subtracting off
the means and dividing by the standard deviation
93
94

Example: Sample Time Series of Plant Growth
Correlations between time series
Minneapolis
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.
Seasonality Accounts for Much Correlation
Minneapolis
Normalized using
monthly Z Score:

Subtract off monthly
mean and divide by
monthly standard
deviation
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000
95
96
Data Mining: Exploring Data
Introduction to Data Mining
by
Tan, Steinbach, Kumar
(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR
2002

2002
What is data exploration?Key motivations of data exploration
includeHelping to select the right tool for preprocessing or
analysisMaking use of humans’ abilities to recognize patterns
People can recognize patterns not captured by data analysis
tools
Related to the area of Exploratory Data Analysis (EDA)Created
by statistician John TukeySeminal book is Exploratory Data
Analysis by TukeyA nice online introduction can be found in
Chapter 1 of the NIST Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/index.htm
A preliminary exploration of the data to better understand its
characteristics.
2002
2002
Techniques Used In Data Exploration In EDA, as originally
defined by TukeyThe focus was on visualizationClustering and
anomaly detection were viewed as exploratory techniquesIn data
mining, clustering and anomaly detection are major areas of
interest, and not thought of as just exploratory
In our discussion of data exploration, we focus onSummary

statisticsVisualizationOnline Analytical Processing (OLAP)
2002
2002
Iris Sample Data Set Many of the exploratory data techniques
are illustrated with the Iris Plant data set.Can be obtained from
the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html From the
statistician Douglas FisherThree flower types (classes): Setosa
Virginica VersicolourFour (non-class) attributes Sepal width
and length Petal width and length
Virginica. Robert H. Mohlenbrock. USDA NRCS. 1995.
Northeast wetland flora: Field office guide to plant species.
Northeast National Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
2002
2002
Summary StatisticsSummary statistics are numbers that
summarize properties of the data

Summarized properties include frequency, location and spread
Examples: location - mean
spread - standard deviation
Most summary statistics can be calculated in a single pass
through the data
2002
2002
Frequency and ModeThe frequency of an attribute value is the
percentage of time the value occurs in the
data set For example, given the attribute ‘gender’ and a
representative population of people, the gender ‘female’ occurs
about 50% of the time.The mode of a an attribute is the most
frequent attribute value The notions of frequency and mode are
typically used with categorical data
2002
2002
PercentilesFor continuous data, the notion of a percentile is
more useful.

Given an ordinal or continuous attribute x and a number p
between 0 and 100, the pth percentile is a value of x such
that p% of the observed values of x are less than .
For instance, the 50th percentile is the value such that 50%
of all values of x are less than .
2002
2002
Measures of Location: Mean and MedianThe mean is the most
common measure of the location of a set of points. However,
the mean is very sensitive to outliers. Thus, the median or a
trimmed mean is also commonly used.
2002
2002
Measures of Spread: Range and VarianceRange is the difference

between the max and minThe variance or standard deviation is
the most common measure of the spread of a set of points.
However, this is also sensitive to outliers, so that other
measures are often used.
2002
2002
Visualization
Visualization is the conversion of data into a visual or tabular
format so that the characteristics of the data and the
relationships among data items or attributes can be analyzed or
reported.
Visualization of data is one of the most powerful and appealing
techniques for data exploration. Humans have a well developed
ability to analyze large amounts of information that is presented
visuallyCan detect general patterns and trendsCan detect
outliers and unusual patterns
2002
2002

Example: Sea Surface TemperatureThe following shows the Sea
Surface Temperature (SST) for July 1982Tens of thousands of
data points are summarized in a single figure
2002
2002
RepresentationIs the mapping of informa tion to a visual
formatData objects, their attributes, and the relationships among
data objects are translated into graphical elements such as
points, lines, shapes, and colors.Example: Objects are often
represented as pointsTheir attribute values can be represented as
the position of the points or the characteristics of the points,
e.g., color, size, and shapeIf position is used, then the
relationships of points, i.e., whether they form groups or a point
is an outlier, is easily perceived.
2002
2002
ArrangementIs the placement of visual elements within a
displayCan make a large difference in how easy it is to

understand the dataExample:
2002
2002
SelectionIs the elimination or the de-emphasis of certain objects
and attributesSelection may involve the chossing a subset of
attributes Dimensionality reduction is often used to reduce the
number of dimensions to two or threeAlternatively, pairs of
attributes can be consideredSelection may also involve choosing
a subset of objects A region of the screen can only show so
many pointsCan sample, but want to preserve points in sparse
areas
2002
2002
Visualization Techniques: HistogramsHistogram Usually shows
the distribution of values of a single variableDivide the values
into bins and show a bar plot of the number of objects in each
bin. The height of each bar indicates the number of
objectsShape of histogram depends on the number of
binsExample: Petal Width (10 and 20 bins, respectively)

2002
2002
Two-Dimensional HistogramsShow the joint distribution of the
values of two attributes Example: petal width and petal
lengthWhat does this tell us?
2002
2002
Visualization Techniques: Box PlotsBox Plots Invented by J.
TukeyAnother way of displaying the distribution of data
Following figure shows the basic part of a box plot

outlier
10th percentile
25th percentile
75th percentile
50th percentile
10th percentile
2002
2002
Example of Box Plots Box plots can be used to compare
attributes
2002
2002

Visualization Techniques: Scatter PlotsScatter plots Attributes
values determine the positionTwo-dimensional scatter plots
most common, but can have three-dimensional scatter
plotsOften additional attributes can be displayed by using the
size, shape, and color of the markers that represent the objects
It is useful to have arrays of scatter plots can compactly
summarize the relationships of several pairs of attributes See
example on the next slide
2002
2002
Scatter Plot Array of Iris Attributes
2002
2002
Visualization Techniques: Contour PlotsContour plots Useful
when a continuous attribute is measured on a spatial gridThey

partition the plane into regions of similar valuesThe contour
lines that form the boundaries of these regions connect points
with equal values The most common example is contour maps
of elevationCan also display temperature, rainfall, air pressure,
etc.An example for Sea Surface Temperature (SST) is provided
on the next slide
2002
2002
Contour Plot Example: SST Dec, 1998
Celsius
2002
2002
Visualization Techniques: Matrix PlotsMatrix plots Can plot the
data matrixThis can be useful when objects are sorted according
to classTypically, the attributes are normalized to prevent one
attribute from dominating the plot Plots of similarity or
distance matrices can also be useful for visualizing the
relationships between objectsExamples of matrix plots are
presented on the next two slides

2002
2002
Visualization of the Iris Data Matrix
standard
deviation
2002
2002
Visualization of the Iris Correlation Matrix
2002
2002

Visualization Techniques: Parallel CoordinatesParallel
Coordinates Used to plot the attribute values of high-
dimensional dataInstead of using perpendicular axes, use a set
of parallel axes The attribute values of each object are plotted
as a point on each corresponding coordinate axis and the points
are connected by a line Thus, each object is represented as a
line Often, the lines representing a distinct class of objects
group together, at least for some attributesOrdering of attributes
is important in seeing such groupings
2002
2002
Parallel Coordinates Plots for Iris Data
2002
2002
Other Visualization TechniquesStar Plots Similar approach to
parallel coordinates, but axes radiate from a central pointThe
line connecting the values of an object is a polygonChernoff
FacesApproach created by Herman ChernoffThis approach
associates each attribute with a characteristic of a faceThe

values of each attribute determine the appearance of the
corresponding facial characteristic Each object becomes a
separate faceRelies on human’s ability to distinguish faces
2002
2002
Star Plots for Iris Data
Setosa
Versicolour
Virginica
2002
2002
Chernoff Faces for Iris Data
Setosa

Versicolour
Virginica
2002
2002
OLAPOn-Line Analytical Processing (OLAP) was proposed by
E. F. Codd, the father of the relational database.Relational
databases put data into tables, while OLAP uses a
multidimensional array representation. Such representations of
data previously existed in statistics and other fieldsThere are a
number of data analysis and data exploration operations that are
easier with such a data representation.
2002
2002

Creating a Multidimensional ArrayTwo key steps in converting
tabular data into a multidimensional array.First, identify which
attributes are to be the dimensions and which attribute is to be
the target attribute whose values appear as entries in the
multidimensional array.The attributes used as dimensions must
have discrete valuesThe target value is typically a count or
continuous value, e.g., the cost of an itemCan have no target
variable at all except the count of objects that have the same set
of attribute valuesSecond, find the value of each entry in the
multidimensional array by summing the values (of the target
attribute) or count of all objects that have the attribute values
corresponding to that entry.
2002
2002
Example: Iris dataWe show how the attributes, petal length,
petal width, and species type can be converted to a
multidimensional arrayFirst, we discretized the petal width and
length to have categorical values: low, medium, and highWe get
the following table - note the count attribute
2002
2002

Example: Iris data (continued)Each unique tuple of petal width,
petal length, and species type identifies one element of the
array.This element is assigned the corresponding count value.
The figure illustrates
the result.All non-specified
tuples are 0.
2002
2002
Example: Iris data (continued)Slices of the multidimensional
array are shown by the following cross-tabulationsWhat do
these tables tell us?
2002
2002
OLAP Operations: Data CubeThe key operation of a OLAP is

the formation of a data cubeA data cube is a multidimensional
representation of data, together with all possible aggregates.By
all possible aggregates, we mean the aggregates that result by
selecting a proper subset of the dimensions and summing over
all remaining dimensions.For example, if we choose the species
type dimension of the Iris data and sum over all other
dimensions, the result will be a one-dimensional entry with
three entries, each of which gives the number of flowers of each
type.
2002
2002
Data Cube ExampleConsider a data set that records the sales of
products at a number of company stores at various dates.This
data can be represented
as a 3 dimensional arrayThere are 3 two-dimensional
aggregates (3 choose 2 ),
3 one-dimensional aggregates,
and 1 zero-dimensional
aggregate (the overall total)
2002

2002
Data Cube Example (continued)The following figure table
shows one of the two dimensional aggregates, along with two of
the one-dimensional aggregates, and the overall total
2002
2002
OLAP Operations: Slicing and DicingSlicing is selecting a
group of cells from the entire multidimensional array by
specifying a specific value for one or more dimensions. Dicing
involves selecting a subset of cells by specifying a range of
attribute values. This is equivalent to defining a subarray from
the complete array. In practice, both operations can also be
accompanied by aggregation over some dimensions.
2002
2002

OLAP Operations: Roll-up and Drill-downAttribute values often
have a hierarchical structure.Each date is associated with a year,
month, and week.A location is associated with a continent,
country, state (province, etc.), and city. Products can be divided
into various categories, such as clothing, electronics, and
furniture.Note that these categories often nest and form a tree or
latticeA year contains months which contains dayA country
contains a state which contains a city
2002
2002
OLAP Operations: Roll-up and Drill-downThis hierarchical
structure gives rise to the roll-up and drill-down operations.For
sales data, we can aggregate (roll up) the sales across all the
dates in a month. Conversely, given a view of the data where
the time dimension is broken into months, we could split the
monthly sales totals (drill down) into daily sales
totals.Likewise, we can drill down or roll up on the location or
product ID attributes.
2002
2002

Data Mining
Classification: Basic Concepts and
Techniques
Introduction to Data Mining, 2nd Edition
by
02/03/2020 Introduction to Data Mining, 2nd Edition 1
Classification: Definition
l Given a collection of records (training set )
– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
tor, independent variable, input
l Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
1
2

Examples of Classification Task
Task Attribute set, x Class label, y
Categorizing
email
messages
Features extracted from
email message header
and content
spam or non-spam
Identifying
tumor cells
x-rays or MRI scans
malignant or benign
cells
Cataloging
galaxies
telescope images
Elliptical, spiral, or
irregular-shaped
galaxies

General Approach for Building
Classification Model
Apply
Model
Learn
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?

12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
3
4
Classification Techniques
� Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Deep Learning
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
� Ensemble Classifiers
– Boosting, Bagging, Random Forests
Example of a Decision Tree

ID Home Owner
Marital
Status
Annual
Income
Defaulted
Borrower
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Home
Owner

MarSt
Income
YESNO
NO
NO
Yes No
MarriedSingle, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
5
6
Another Example of Decision Tree
MarSt
Home
Owner
Income

YESNO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
ID Home Owner
Marital
Status
Annual
Income
Defaulted
Borrower
3 No Single 70K No

6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Apply Model to Test Data
Home
Owner
MarSt
Income
YESNO
NO
NO
Yes No

< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Start from the root of tree.
7
8
MarSt
Income
YESNO

NO
NO
Yes No
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner

MarSt
Income
YESNO
NO
NO
Yes No
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner

9
10
MarSt
Income
YESNO
NO
NO
Yes No
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income

Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
MarSt
Income
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Home
Owner
Marital

Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
11
12
MarSt
Income
YESNO
NO
NO

Yes No
Married Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Assign Defaulted to
“No”
Home
Owner
Decision Tree Classification Task
Apply

Model
Learn
Model
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

14 No Small 95K ?
15 No Large 67K ?
10
Decision
Tree
13
14
Decision Tree Induction
� Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
l Let Dt be the set of training
records that reach a node t
l General Procedure:
– If Dt contains records that

belong the same class yt,
then t is a leaf node
labeled as yt
– If Dt contains records that
belong to more than one
class, use an attribute test
to split the data into smaller
subsets. Recursively apply
the procedure to each
subset.
Dt
?
ID Home Owner
Marital
Status
Annual
Income
Defaulted
Borrower
3 No Single 70K No

6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
15
16
Hunt’s Algorithm
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)

ID Home Owner
Marital
Status
Annual
Income
Defaulted
Borrower
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Hunt’s Algorithm

(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID Home Owner
Marital
Status
Annual
Income
Defaulted
Borrower
3 No Single 70K No

6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
17
18
Hunt’s Algorithm
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)

ID Home Owner
Marital
Status
Annual
Income
Defaulted
Borrower
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Hunt’s Algorithm

6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
19
20
Design Issues of Decision Tree Induction
l How should training records be split?
– Method for specifying test condition
– Measure for evaluating the goodness of a test
condition
l How should the splitting procedure stop?
– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination

Methods for Expressing Test Conditions
l Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
l Depends on number of ways to split
– 2-way split
– Multi-way split
21
22
Test Condition for Nominal Attributes
� Multi-way split:
– Use as many partitions as
distinct values.
� Binary split:
– Divides values into two subsets
Test Condition for Ordinal Attributes

l Multi-way split:
– Use as many partitions
as distinct values
l Binary split:
– Divides values into two
subsets
– Preserve order
property among
attribute values This grouping
violates order
property
23
24
Test Condition for Continuous Attributes
Splitting Based on Continuous Attributes
� Different ways of handling
– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,

equal frequency bucketing (percentiles), or
clustering.
– discretize once at the beginning
– repeat at each node
–
sive
25
26
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split
l Greedy approach:
– Nodes with purer class distribution are
preferred
l Need a measure of node impurity:
High degree of impurity Low degree of impurity

27
28
Measures of Node Impurity
l Gini Index
l Entropy
l Misclassification error
�� = 1 − � �
�� = − � � �� (�)
�� = 1 − max [� (�)]
Where �� is the frequency of class � at node t, and � is
the total number of classes
Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
l Compute impurity measure of each child node
l M is the weighted impurity of child nodes
3. Choose the attribute test condition that
produces the highest gain

Gain = P - M
or equivalently, lowest impurity measure after splitting
(M)
29
30
Finding the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21

C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
P
M11 M12 M21 M22
M1 M2
Gain = P – M1 vs P – M2
Measure of Impurity: GINI
� Gini Index for a given node �
Where �� is the frequency of class � at node �, and � is
the total
number of classes
– Maximum of 1 − 1/� when records are equally
distributed among all classes, implying the least

beneficial situation for classification
– Minimum of 0 when all records belong to one class,
implying the most beneficial situation for classification
�� = 1 − � �
31
32
Measure of Impurity: GINI
� Gini Index for a given node t :
– For 2-class problem (p, 1 – p):
– p2 – (1 – p)2 = 2p (1-p)
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500

C1 1
C2 5
Gini=0.278
�� = 1 − � �
Computing Gini Index of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444
�� = 1 − � �
33
34
Computing Gini Index for a Collection of
Nodes
l When a node � is split into � partitions (children)
where, � = number of records at child �,� = number of records
at parent node �.
l Choose the attribute that minimizes weighted average
Gini index of the children
l Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT
�� = �� (�)
Binary Attributes: Computing GINI Index
� Splits into two partitions (child nodes)
� Effect of Weighing partitions:
– Larger and purer partitions are sought

B?
Yes No
Node N1 Node N2
Parent
C1 7
C2 5
Gini = 0.486
N1 N2
C1 5 2
C2 1 4
Gini=0.361
Gini(N1)
= 1 – (5/6)2 – (1/6)2
= 0.278
Gini(N2)
= 1 – (2/6)2 – (4/6)2
= 0.444
Weighted Gini of N1 N2
= 6/12 * 0.278 +
6/12 * 0.444
= 0.361
Gain = 0.486 – 0.361 = 0.125

35
36
Categorical Attributes: Computing Gini Index
l For each distinct value, gather counts for each class in
the dataset
l Use the count matrix to make decisions
CarType
{Sports,
Luxury} {Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports} {Family,Luxury}
C1 8 2
C2 0 10
Gini 0.167

CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
Multi-way split Two-way split
(find best partition of values)
Which of these is the best?
Continuous Attributes: Computing Gini Index
l Use Binary Decisions based on one
value
l Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
l Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A ≤ v and A > v
l Simple method to choose best v

– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
ID Home Owner
Marital
Status
Annual
Income Defaulted
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

≤ 80 > 80
Defaulted Yes 0 3
Defaulted No 3 4
Annual Income ?
37
38
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375
0.400 0.420
Continuous Attributes: Computing Gini Index...

l For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count
matrix
and computing gini index
– Choose the split position that has the least gini index
Sorted Values
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375
0.400 0.420
matrix

Split Positions
Sorted Values
39
40
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375
0.400 0.420

matrix
Split Positions
Sorted Values
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375
0.400 0.420
matrix

Split Positions
Sorted Values
41
42
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375
0.400 0.420

matrix
Split Positions
Sorted Values
Measure of Impurity: Entropy
l Entropy at a given node �
Where �� is the frequency of class � at node �, and � is
the total number
of classes
� when records are equally distributed
among all classes, implying the least beneficial situation for
classification
m of 0 when all records belong to one class,
implying most beneficial situation for classification
– Entropy based computations are quite similar to the GINI
index computations
�� = − � � �� (�)
43
44

Computing Entropy of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
�� = − � � �� (�)

Computing Information Gain After Splitting
l Information Gain:
Parent Node, � is split into � partitions (children)� is number
of records in child node �
– Choose the split that achieves most reduction (maximizes
GAIN)
– Used in the ID3 and C4.5 decision tree algorithms
– Information gain is the mutual information between the class
variable and the splitting variable
�� = �� − �� (�)
45
46
Problem with large number of partitions
� Node impurity measures tend to prefer splits that
result in large number of partitions, each being
small but pure
– Customer ID has highest information gain
because entropy for all the children is zero

Gain Ratio
l Gain Ratio:
– Adjusts Information Gain by the entropy of the partitioning
(�� ).
mber of small partitions)
is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain
�� = ��
�� = − ��
47
48
Gain Ratio
l Gain Ratio:
CarType
{Sports,

Luxury} {Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports} {Family,Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97
�� = ��
�� = ��

Measure of Impurity: Classification Error
l Classification error at a node �
– Maximum of 1 − 1/� when records are equally
distributed among all classes, implying the least
interesting situation
– Minimum of 0 when all records belong to one class,
implying the most interesting situation
�� = 1 − max [� � ]
49
50
Computing Error of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
�� = 1 − max [� � ]
Comparison among Impurity Measures
For a 2-class problem:
51
52
Misclassification Error vs Gini Index
A?
Yes No

Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
Gini(N1)
= 1 – (3/3)2 – (0/3)2
= 0
Gini(N2)
= 1 – (4/7)2 – (3/7)2
= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves but
error remains the
same!!
Misclassification Error vs Gini Index

A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
N1 N2
C1 3 4
C2 1 2
Gini=0.416
Misclassification error for all three cases = 0.3 !
53
54

Decision Tree Based Classification
l Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
l Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
Handling interactions
X
Y
+ : 1000 instances
o : 1000 instances
Entropy (X) : 0.99
Entropy (Y) : 0.99

55
56
Handling interactions
+ : 1000 instances
o : 1000 instances
Adding Z as a noisy
attribute generated
from a uniform
distribution
Y
Z
Y
Z
X
Entropy (X) : 0.99
Entropy (Y) : 0.99
Entropy (Z) : 0.98
Attribute Z will be
chosen for splitting!
X

Limitations of single attribute-based decision boundaries
Both positive (+) and
negative (o) classes
generated from
skewed Gaussians
with centers at (8,8)
and (12,12)
respectively.
57
58
Data Mining
Model Overfitting
Introduction to Data Mining, 2nd Edition
by
Classification Errors

� Training errors (apparent errors)
– Errors committed on the training set
� Test errors
– Errors committed on the test set
� Generalization errors
– Expected error of a model over random
selection of records from same distribution
1
2
Example Data Set
Two class problem:
+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
• 400 noisy instances added
o : 5400 instances
• Generated from a uniform
distribution

10 % of the data used for
training and 90% of the
data used for testing
Increasing number of nodes in Decision Trees
3
4
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
Decision TreeDecision Tree
Decision boundaries on Training data
5
6

Which tree is better?
Which tree is better ?
Model Overfitting
Underfitting: when model is too simple, both training and test
errors are large
Overfitting: when model is too complex, training error is small
but test error is large
•As the model becomes more and more complex, test errors can
start
increasing even though training error may be decreasing
7
8
Model Overfitting
Using twice the number of data instances
• Increasing the size of training data reduces the difference

between training and
testing errors at a given size of model
Model Overfitting
Using twice the number of data instances
• Increasing the size of training data reduces the difference
between training and
testing errors at a given size of model
Decision Tree with 50 nodes Decision Tree with 50 nodes
9
10
Reasons for Model Overfitting
� Limited Training Size
� High Model Complexity
– Multiple Comparison Procedure
Effect of Multiple Comparison Procedure
� Consider the task of predicting whether
stock market will rise/fall in the next 10

trading days
� Random guessing:
P(correct) = 0.5
� Make 10 random guesses in a row:
Day 1 Up
Day 2 Down
Day 3 Down
Day 4 Up
Day 5 Down
Day 6 Down
Day 7 Up
Day 8 Up
Day 9 Up
Day 10 Down
0547.0
2
10
10
9
10
8
10

11
12
� Approach:
– Get 50 analysts

– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
� Probability that at least one analyst makes at
least 8 correct predictions
� Many algorithms employ the following greedy strategy:
– Initial model: M
–
(e.g., a test condition of a decision tree)
–
�
� If many alternatives are available, one may inadvertently
add irrelevant components to the model, resulting in
model overfitting
13
14

Effect of Multiple Comparison - Example
Use additional 100 noisy variables
generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and
70% of the data for testing
Using only X and Y as attributes
Notes on Overfitting
� Overfitting results in decision trees that are more
complex than necessary
� Training error does not provide a good estimate
of how well the tree will perform on previously
unseen records
� Need ways for estimating generalization errors
15
16
Model Selection

� Performed during model building
� Purpose is to ensure that model is not overly
complex (to avoid overfitting)
� Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
– Estimating Statistical Bounds
Model Selection:
Using Validation Set
� Divide training data into two parts:
– Training set:
– Validation set:
or estimating generalization error
� Drawback:
– Less data available for training
17
18

Model Selection:
Incorporating Model Complexity
� Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
– A complex model has a greater chance of being fitted
accidentally
– Therefore, one should include model complexity when
evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
Estimating the Complexity of Decision Trees
� Pessimistic Error Estimate of decision tree T
with k leaf nodes:
– err(T): error rate on all training records
– -off hyper-parameter (similar to )
– k: number of leaf nodes
– Ntrain: total number of training records
19

20
Estimating the Complexity of Decision Trees: Example
e(TL) = 4/24
e(TR) = 6/24
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
Estimating the Complexity of Decision Trees
� Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24
21
22

Minimum Description Length (MDL)
� Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)
– Cost is the number of bits needed for encoding.
– Search for the least costly model.
� Cost(Data|Model) encodes the misclassification errors.
� Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
A B
A?
B?
C?
10
0
1
Yes No
B1 B2
C1 C2
X y

X1 1
X2 0
X3 0
X4 1
… …
Xn 1
X y
X1 ?
X2 ?
X3 ?
X4 ?
… …
Xn ?
Estimating Statistical Bounds
Before splitting: e = 2/7, e’(7, 2/7, 0.25) = 0.503
After splitting:
e(TL) = 1/4, e’(4, 1/4, 0.25) = 0.537
e(TR) = 1/3, e’(3, 1/3, 0.25) = 0.650
e’(T) = 4
N
z
N
z

N
ee
z
N
z
e
eNe 2
2/
2
2
2/
2/
2
2/
1
4
)1(
2),,('

Therefore, do not split
23
24
Model Selection for Decision Trees
� Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
– More restrictive conditions:
-specified
threshold
measures (e.g., Gini or information gain).

threshold
Model Selection for Decision Trees
� Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
-up
fashion
replace sub-tree by a leaf node
majority class of instances in the sub-tree
– Subtree raising
25
26
Example of Post-Pruning
A?
A1
A2 A3

A4
Class = Yes 20
Class = No 10
Error = 10/30
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
PRUNE!
Class = Yes 8
Class = No 4
Class = Yes 3
Class = No 4
Class = Yes 4
Class = No 1
Class = Yes 5
Class = No 1
Examples of Post-pruning

27
28
Model Evaluation
� Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
� Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
� Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
Cross-validation Example
� 3-fold cross-validation
29
30

Variations on Cross-validation
� Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
� Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
� Use nested cross-validation approach for model
selection and evaluation
31
COLLAPSE
Top of Form
During quarantine I've been watching my boyfriend play more
of his video games lately. The lobbies he's in tend to be pretty
mature, and even toxic at times, despite games that are usually
targeted at a young audience (Fortnite for example). Without
the supervision of adults during their gaming, I've noticed many
young children have a tendency to say and do things they and
their parents would consider inappropriate even if they have
been taught better. I would want to see how many children are
influenced by having older gaming members in their lobby and
how this effects their speech and gaming decisions.

In this setting, I would be an unacknowledged observer because
rather than playing and seeing how children's behaviors change
toward older community members, I would be simply observing
through my boyfriend's game play. According to the text, in this
case without informing the children that they are being observed
there could be some ethic rule breaking that would occur, but on
the other hand, letting the children know that their behaviors are
being watched could alter the natural reactivity of the
participants (Stangor, 2014).
Some of the behaviors and episodes I included in my coding
form would be general examples of aggressive behaviors and
how the child would react to it gaming wise and vocally. For
example, if an older member choose to curse out of frustration
and call the game lame because they weren't winning, what
would the child who witnessed that behavior do in turn? Would
their behavior be aggressive, neutral or virtuous in turn? Would
their gaming tactics become aggressive, virtuous, or neutral?
Aaggressive behavior/gaming would count as mimicking the
older peers aggressive behaviors. So in this example if the child
chose to curse and run around the game targeting his teammates
it would be marked aggressively. Neutral behaviors would be
how the child would react if they were playing on their own. A
neutral reaction here would be if the child said goodbye to their
teammate and continued playing regardless of the thoughts or
behaviors of the older member. This reaction is expected to be
rather rare given the fact that children tend to look up to and
admire older members of communities they are apart of. A
virtuous reaction would be something that is said that is
positive, corrects aggressive behaviors, or is
encouraging/teams-like. So, if the child after seeing this
behavior made a comment about how the other member is really
good at the game and that they should keep playing to improve,
or maybe offering additional help to the suffering teammate,
this would be considered virtuous behavior.

Reference:
Stangor, C (2014). Research methods for behavior sciences (4th
ed.) Stamford, CT: Cengage Learning.
Bottom of Form

Coder Name Rebecca Oquendo

Recommended

Recommended

More Related Content

Similar to Coder Name Rebecca Oquendo

Similar to Coder Name Rebecca Oquendo (17)

More from DioneWang844

More from DioneWang844 (20)

Recently uploaded

Recently uploaded (20)

Coder Name Rebecca Oquendo