SlideShare a Scribd company logo
1 of 157
• Compareandcontrastthefourartworksprovided(Kongo
Crucifix,EthiopianCross,BustofOoni,andJosyAjiboye’s
Oonipainting).
• UseartworksA,BandCtodiscussthelost-waxprocessof
bronzecasting
• Usetheconceptofculturalappropriationtodiscusshow
artworksAandBadaptedChristianity,andartworkD
adaptedmodernisminAfricanart.
Note:
• Interpretationoftheseartworksshoulddiscusstheirorigins,
historicalrelevance,functionsandsymbolism.
• Defineyourtermsproperlysothattheysupportyouranalysisofthe
images.Appropriationmeansadaptation,asinwhenartistsinone
contexttranslateandtransformcross-culturalconceptsfromanother
context.
• Citeallreferences,especiallyonlinereferencesusingfootnotes
• Avoidmerelyduplicatinginformationfromexternalsources:
plagiarismwillbeseriouslypenalized.
(A)Left:KongoKingdomofSaoSalvador,Crucifix,brass,17thCentu
ryA.D.
(B)Right:Ethiopia(Aksum),ProcessionalCross,bronze,14th-
15thCenturyA.D.
(C)Left:YorubaPeoples(IfeKingdom),BustofOoni(king),bronze,1
0th-14thCenturiesA.D.
(D)Right:JosyAjiboye,Ooni,oiloncanvas,1976
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Dr. Oner Celepcikay
ITS 632
Data Mining
Summer 2019Week 3: Data and Data Exploration
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Chapter 2: Data
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
What is Data?
● Collection of data objects and
their attributes
● An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
● A collection of attributes
describe an object
– Object is also known as
record, point, case, sample,
entity, or instance
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Attribute Values
● Attribute values are numbers or symbols assigned
to an attribute
● Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute
values
u Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of
values
u Example: Attribute values for ID and age are integers
u But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
– Some operations are meaningful on age but meaningless on ID
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Types of Attributes
● There are different types of attributes
– Nominal
u Examples: ID numbers, eye color, zip codes
– Ordinal
u Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
u Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
u Examples: temperature in Kelvin, length, time, counts
Attribute
Type
Description Examples Operations
Nominal The values of a nominal attribute are
just different names, i.e., nominal
attributes provide only enough
information to distinguish one object
from another. (=, ¹)
zip codes, employee
ID numbers, eye color,
sex: {male, female}
mode, entropy,
contingency
correlation, c2 test
Ordinal The values of an ordinal attribute
provide enough information to order
objects. (<, >)
hardness of minerals,
{good, better, best},
grades, street numbers
median, percentiles,
rank correlation,
run tests, sign tests
Interval For interval attributes, the
differences between values are
meaningful, i.e., a unit of
measurement exists.
(+, - )
calendar dates,
temperature in Celsius
or Fahrenheit
mean, standard
deviation, Pearson's
correlation, t and F
tests
Ratio For ratio variables, both differences
and ratios are meaningful. (*, /)
temperature in Kelvin,
monetary quantities,
counts, age, mass,
length, electrical
current
geometric mean,
harmonic mean,
percent variation
Attribute
Level
Transformation Comments
Nominal Any permutation of values If all employee ID numbers
were reassigned, would it
make any difference?
Ordinal An order preserving change of
values, i.e.,
new_value = f(old_value)
where f is a monotonic function.
An attribute encompassing
the notion of good, better
best can be represented
equally well by the values
{1, 2, 3} or by { 0.5, 1,
10}.
Interval new_value =a * old_value + b
where a and b are constants
Thus, the Fahrenheit and
Celsius temperature scales
differ in terms of where
their zero value is and the
size of a unit (degree).
Ratio new_value = a * old_value Length can be measured in
meters or feet.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Properties of Attribute Values
● The type of an attribute depends on which of the
following properties it possesses:
– Distinctness: = ¹
– Order: < >
– Addition: + -
– Multiplication: * /
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Discrete and Continuous Attributes
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-
point
variables.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Types of data sets
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Important Characteristics of Structured Data
– Dimensionality
u Curse of Dimensionality
– Sparsity
u Only presence counts
– Resolution
u Patterns depend on the scale
– Examples: Texas data, Aleks, Simpson’s Paradox
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Record Data
● Data that consists of a collection of records, each
of which consists of a fixed set of attributes
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Data Matrix
● If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
● Such data set can be represented by an m by n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
1.12.216.226.2512.65
1.22.715.225.2710.23
Thickness LoadDistanceProjection
of y load
Projection
of x Load
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Document Data
● Each document becomes a `term' vector,
– each term is a component (attribute) of the vector,
– the value of each component is the number of times the
corresponding term occurs in the document.
– In practice only non-0 is stored
Document 1
season
tim
eout
lost
w
in
gam
e
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Transaction Data
● A special type of record data, where
– each record (transaction) involves a set of items.
– For example, consider a grocery store. The set of
products purchased by a customer during one
shopping trip constitute a transaction, while the
individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Graph Data
● Examples: Generic graph and HTML Links
● Data objects are nodes, links are properties
5
2
1
2
5
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel
Solution
of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Chemical Data
● Benzene Molecule: C6H6
● Nodes are atoms, links are chemical bonds
● helps to identify substructures.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Ordered Data
● Sequences of transactions
An element of
the sequence
Items/Events
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Ordered Data
● Genomic sequence data
● Similar to sequential data but no time stamps
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Ordered Data
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Data Quality
● What kinds of data quality problems?
● How can we detect problems with the data?
● What can we do about these problems?
● Examples of data quality problems:
– Noise and outliers
– missing values
– duplicate data
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Data Quality
● Precision: The closeness of repeated measurements (of
the same quantity) to other measurements.
● Bias: A systematic variation of measurements from the
quantity being measured.
● Accuracy: The closeness of measurements to the true
value of the quantity being measurement.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Noise
● Noise refers to modification of original values
– Examples: distortion of a person�s voice when talking
on a poor phone and �snow� on television screen
Two Sine Waves Two Sine Waves + Noise
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Outliers
● Outliers are data objects with characteristics that
are considerably different than most of the other
data objects in the data set (diff. than noise)
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Missing Values
● Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
● Handling missing values
– Eliminate Data Objects (unless many missing)
– Estimate Missing Values (avg., most common val.)
– Ignore the Missing Value During Analysis
– Replace with all possible values (weighted by their
probabilities)
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Duplicate Data
● Data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeous
sources
– Also attention needed to avoid combining 2 very
similar objects into 1.
● Examples:
– Same person with multiple email addresses
● Data cleaning
– Process of dealing with duplicate data issues
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Data Preprocessing
● Aggregation
● Sampling
● Dimensionality Reduction
● Feature subset selection
● Feature creation
● Discretization and Binarization
● Attribute Transformation
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Aggregation
● Combining two or more attributes (or objects) into
a single attribute (or object)
● Purpose
– Data reduction
u Reduce the number of attributes or objects
– Change of scale
u Cities aggregated into regions, states, countries, etc
– More �stable� data
u Aggregated data tends to have less variability
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Aggregation-Why?
● Less memory & less processing times
– Aggregation allows to use very expensive Algorithms
● High level view of the data
– Store example
● Behavior of groups of objects often more stable
than individual objects.
– A disadvantage of this is losing information or
patterns,
– e.g. if you aggregate days into months, you might
miss the sales peak in Valentine’s Day.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Aggregation
Standard Deviation of Average
Monthly Precipitation
Standard Deviation of Average
Yearly Precipitation
Variation of Precipitation in Australia
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Sampling
● Sampling is the main technique employed for data selection.
– It is often used for both the preliminary investigation of the
data
and the final data analysis.
● Statisticians sample because obtaining the entire set of data
of interest is too expensive or time consuming.
● Sampling is used in data mining because processing the
entire set of data of interest is too expensive or time
consuming.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Sampling …
● The key principle for effective sampling is the
following:
– using a sample will work almost as well as using the
entire data sets, if the sample is representative
– A sample is representative if it has approximately the
same property (of interest) as the original set of data
– If mean is of interest then the mean of the sample,
should be similar to mean of the full data.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Types of Sampling
● Simple Random Sampling
– There is an equal probability of selecting any particular item
● Sampling without replacement
– As each item is selected, it is removed from the population
● Sampling with replacement
– Objects are not removed from the population as they are
selected for the sample.
u In sampling with replacement, the same object can be picked
up
more than once (easier to analyze, probability is constant)
● Stratified sampling
– Split the data into several partitions; then draw random
samples
from each partition (handles representation of less freq. objects)
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Sample Size
8000 points 2000 Points 500 Points
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Sample Size
● What sample size is necessary to get at least one
object from each of 10 groups.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Curse of Dimensionality
● When dimensionality
increases, data becomes
increasingly sparse in the
space that it occupies
● Definitions of density and
distance between points,
which is critical for
clustering and outlier
detection, become less
meaningful
• Randomly generate 500 points
• Compute difference between max and min
distance between any pair of points
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Dimensionality Reduction
● Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
● Techniques
– Principle Component Analysis
– Singular Value Decomposition
– Others: supervised and non-linear techniques
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Dimensionality Reduction: PCA
● Goal is to find a projection that captures the
largest amount of variation in data
x2
x1
e
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Dimensionality Reduction: PCA
● Find the eigenvectors of the covariance matrix
● The eigenvectors define the new space
● Tends to identify strongest patterns in data.
x2
x1
e
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Dimensions = 10Dimensions = 40Dimensions = 80Dimensions =
120Dimensions = 160Dimensions = 206
Dimensionality Reduction: PCA
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Face detection and recognition
Detection Recognition “Sally”
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Feature Subset Selection
● Another way to reduce dimensionality of data
● Redundant features
– duplicate much or all of the information contained in
one or more other attributes
– Example: purchase price of a product and the amount
of sales tax paid
● Irrelevant features
– contain no information that is useful for the data
mining task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Feature Subset Selection
● Techniques:
– Brute-force approch:
uTry all possible feature subsets as input to data mining
algorithm
– Embedded approaches:
u Feature selection occurs naturally as part of the data mining
algorithm
– Filter approaches:
u Features are selected before data mining algorithm is run
– Wrapper approaches:
u Use the data mining algorithm as a black box to find best
subset
of attributes
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Feature Subset Selection
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Feature Creation
● Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
● Three general methodologies:
– Feature Extraction
u domain-specific
– Mapping Data to New Space
– Feature Construction
u combining features (pixels à edges for face recognition)
u e.g. using density instead of mass, volume in identifying
artifacts such as gold, bronze, clay, etc…
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Similarity and Dissimilarity
● Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
● Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
● Proximity refers to a similarity or dissimilarity
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Similarity/Dissimilarity for Simple Attributes
● An example: quality of a product (e.g. candy)
{poor, fair, OK, good, wonderful}
● P1->Wonderful, P->2 good, P3->OK
● P1 is closer to P2 than it is to P3
● Map ordinal attributes into integers:
{poor=0, fair=1, OK=2, good=3, wonderful=4}
● Estimate the distance values for each pair.
● Normalize if you want [1,1] interval
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Euclidean Distance
● Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
● Standardization is necessary, if scales differ.
å
=
-=
n
k
kk qpdist
1
2)(
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Minkowski Distance
● Minkowski Distance is a generalization of Euclidean
Distance
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.
r
n
k
r
kk qpdist
1
1
)||( å
=
-=
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Minkowski Distance: Examples
● r = 1. City block (Manhattan, taxicab, L1 norm) distance.
– A common example of this is the Hamming distance, which is
just the
number of bits that are different between two binary vectors
● r = 2. Euclidean distance
● r ® ¥. �supremum� (Lmax norm, L¥ norm) distance.
– This is the maximum difference between any component of
the vectors
● Do not confuse r with n, i.e., all these distances are
defined for all numbers of dimensions.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Minkowski Distance
Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L¥ p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Common Properties of a Distance
● Distances, such as the Euclidean distance,
have some well known properties.
1. d(p, q) ³ 0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) £ d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
● A distance that satisfies these properties is a
metric
● Examples 2.14 and 2.15
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Common Properties of a Similarity
● Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data
objects), p and q.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) /
(2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Cosine Similarity
● If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of
vector d.
● Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0
+ 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =
(42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 =
(6) 0.5 = 2.245
cos( d1, d2 ) = .3150
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Correlation
● Correlation measures the linear relationship
between objects
● To compute correlation, we standardize data
objects, p and q, and then take their dot product
)(/))(( pstdpmeanpp kk -=¢
)(/))(( qstdqmeanqq kk -=¢
qpqpncorrelatio ¢•¢=),(
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Correlation
● Correlation measures the linear relationship
between objects
● To compute correlation, we standardize data
objects, p and q, and then take their dot product
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Density
● Density-based clustering require a notion of
density
● Examples:
– Euclidean density
u Euclidean density = number of points per unit volume
– Probability density
– Graph-based density
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Euclidean Density – Cell-based
● Simplest approach is to divide region into a
number of rectangular cells of equal volume and
define density as # of points the cell contains
© Tan,Steinbach, Kumar Introduction to Data Mining
4/18/2004 ‹#›
Euclidean Density – Center-based
● Euclidean density is the number of points within a
specified radius of the point
Dr. Oner Celepcikay
CS 4319
CS 4319
Classification
Spring 2019
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Machine Learning Methods - Classification
CS 4319
Given a collection of records (training set)
- Each record contains a set of attributes, one of the attributes is
the class.
Find a model for class attribute as a function of the values of
other attributes.
A test set is used to estimate the accuracy of the model.
Goal: previously unseen records (test set) should be assigned a
class as accurately as possible.
Machine Learning – Classification Example
CS 4319
categorical
categorical
continuous
class
Test
Set
Training
Set
Model
Learn
Classifier
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Splitting Attributes
Model: Decision Tree
Machine Learning – Classification Example
categorical
categorical
continuous
CS 4319
class
MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
There could be more than one tree that fits the same data!
categorical
categorical
continuous
Another Example of Decision Tree
CS 4319
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
CS 4319
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
CS 4319
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
CS 4319
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
CS 4319
Test Data
Start from the root of tree.
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Apply Model to Test Data
CS 4319
Test Data
Start from the root of tree.
Apply Model to Test Data
CS 4319
No
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes
No
Married
Single, Divorced
< 80K
> 80K
Machine Learning – Classification Example
CS 4319
categorical
categorical
continuous
class
Model
Learning
Algorithm
Induction
Deduction
General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
General Procedure:
If Dt contains records that belong the same class yt, then t is a
leaf node labeled as yt
If Dt is an empty set, then t is a leaf node labeled by the default
class, yd
If Dt contains records that belong to more than one class, use an
attribute test to split the data into smaller subsets. Recursively
apply the procedure to each subset.
Dt
?
CS 4319
Don’t
Cheat
Refund
Don’t
Cheat
Don’t
Cheat
Yes
No
Refund
Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Taxable
Income
Don’t
Cheat
< 80K
>= 80K
Refund
Don’t
Cheat
Yes
No
Marital
Status
Don’t
Cheat
Cheat
Single,
Divorced
Married
Hunt’s Algorithm
CS 4319
Decision Tree Application to Oil & Gas Data
CS 4319
British Petroleum designed a decision tree for gas-oil separation
for offshore oil platforms that replaced an earlier rule-based
expert system.
We will do a similar (but simpler) decision tree example
towards the end of the semester.
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Tree Induction
CS 4319
How to determine the Best Split
CS 4319
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
How to determine the Best Split
CS 4319
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity
CS 4319
Gini Index
Entropy
Misclassification error
How to Find the Best Split
CS 4319
B?
Yes
No
Node N3
Node N4
A?
Yes
No
Node N1
Node N2
Before Splitting:
M0
M1
M2
M3
M4
M12
M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
CS 4319
Gini Index for a given node t :
Need a measure of node impurity:
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (0.5) when records are equally distributed among all
classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying
most interesting information
Examples for computing GINI
CS 4319
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Examples for computing GINI
CS 4319
A?
Yes
No
Node N1
Node N2
Gini(N1)
= 1 – (4/7)2 – (3/7)2
= 0.4898
Gini(N2)
= 1 – (2/5)2 – (3/5)2
= 0.48
Gini(Children)
= 7/12 * 0.4898 +
5/12 * 0.48
= 0.486
Examples for computing GINI
CS 4319
B?
Yes
No
Node N1
Node N2
Gini(N1)
= 1 – (/)2 – (/)2
=
Gini(N2)
= 1 – (/)2 – (/)2
=
Gini(Children)
=
Classification error at a node t :
Measures misclassification error made by a node.
Maximum (0.5) when records are equally distributed among all
classes, implying least interesting information
Minimum (0) when all records belong to one class, implying
most interesting information
Splitting Criteria based on Classification Error
CS 4319
Splitting Criteria based on Classification Error
CS 4319
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting (Next class!)
ANY IDEAS??
Tree Induction
CS 4319
Classification Methods
CS 4319
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
YesMarried
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund Marital
Status
Taxable
Income Cheat
No Married 80K ?
10
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
YesMarried
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K
No
2 No Married 100K
No
3 No Single 70K
No
4 Yes Married 120K
No
5 No Divorced 95K
Yes
6 No Married 60K
No
7 Yes Divorced 220K
No
8 No Single 85K
Yes
9 No Married 75K
No
10 No Single 90K
Yes
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
NoMarried
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Own
Car?
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type?
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Student
ID?
...
Yes
No
Family
Sports
Luxuryc
1
c
10
c
20
C0: 0
C1: 1
...
c
11
Own Car?�
C0: 6
C1: 4�
C0: 4
C1: 6�
Car Type?�
C0: 1
C1: 3�
C0: 8
C1: 0�
C0: 1
C1: 7�
C0: 1
C1: 0�
C0: 1
C1: 0�
C0: 0
C1: 1�
Student ID?�
...�
Yes�
No�
Family�
Sports�
Luxury�
c1�
c10�
c20�
C0: 0
C1: 1�
...�
c11�
C0: 5
C1: 5
C0: 9
C1: 1
C0: 5
C1: 5�
C0: 9
C1: 1�
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
C0
N40C1
N41C0
N00C1
N01C0
N10C1
N11C0
N20C1
N21C0
N30C1
N31
å
-
=
j
t
j
p
t
GINI
2
)]
|
(
[
1
)
(
C1
0
C2
6
Gini=0.000
C1
2
C2
4
Gini=0.444
C1
3
C2
3
Gini=0.500
C1
1
C2
5
Gini=0.278C1
1
C2
5
Gini=0.278
C1
0
C2
6
Gini=0.000
C1
2
C2
4
Gini=0.444
C1
3
C2
3
Gini=0.500
C1
0
C2
6
C1
2
C2
4
C1
1
C2
5
C1
0C2
6C1
2C2
4C1
1C2
5
Parent
C1
6
C2
6
Gini = 0.500
N1 N2
C1 4 2
C2 3 3
Gini=0.486
N1 N2
C1 4 2
C2 3 3
Gini=0. 486
ParentC1
6C2
6
Gini = 0.500
N1
N2C1
4
2C2
3
3Gini=0.486
N1 N2
C1 1 5
C2 4 2
Gini=?
N1 N2
C1 1 5
C2 4 2
Gini= ?
ParentC1
6C2
6
Gini = 0.500
N1
N2C1
1
5C2
4
2Gini=?
)
|
(
max
1
)
(
t
i
P
t
Error
i
-
=C1
1C2
5C1
0C2
6C1
2C2
4
Dr. Oner Celepcikay
CS 4319
CS 4319
Machine Learning
Week 6
Data Science Tool I – Classification Part II
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Tree InductionGreedy strategy.Split the records based on an
attribute test that optimizes certain criterion.
IssuesDetermine how to split the recordsHow to specify the
attribute test condition?How to determine the best
split?Determine when to stop splitting
Stopping Criteria for Tree InductionStop expanding a node
when all the records belong to the same class
Stop expanding a node when all the records have similar
attribute values
Early termination (to be discussed later)
Practical Issues of ClassificationUnderfitting and Overfitting
Missing Values
Costs of Classification
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test
errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Noise
* Bats and Whales are misclassified; non-mammals instead of
mammals.
Overfitting due to Noise
Decision boundary is distorted by noise point
Both humans and dolphins were misclassified as n0n-mammals
b/c Body Temp, Gives_Birth and Four-legged values are
identical to mislabeled records in training set.
Spiny anteaters represent an exceptional case (every warm-
blooded with no gives_birth is non-mammal in TR_Set
Decision tree perfectly fits training data (training error=0)
But error rate on test data is 30%.
Overfitting due to Noise
Estimating Generalization ErrorsRe-substitution errors: error on
Methods for estimating generalization errors:Optimistic
approach: e’(t) = e(t)Pessimistic approach: For each leaf
(N: number of leaf nodes) For a tree with 30 leaf nodes and 10
errors on training
(out of 1000 instances):
Training error = 10/1000 = 1%
2.5%Reduced error pruning (REP): uses validation data set to
estimate generalization
error
Occam’s RazorGiven two models of similar generalization
errors, one should prefer the simpler model over the more
complex model
For complex models, there is a greater chance that it was fitted
accidentally by errors in data
Therefore, one should include model complexity when
evaluating a model
How to Address OverfittingPre-Pruning (Early Stopping
Rule)Stop the algorithm before it becomes a fully-grown
treeTypical stopping conditions for a node: Stop if all instances
belong to the same class Stop if all the attribute values are the
sameMore restrictive conditions: Stop if number of instances is
less than some user-specified threshold Stop if class distribution
of instances are independent of the available features (e.g.,
current node does not
improve impurity
measures (e.g., Gini or information gain).
How to Address Overfitting…Post-pruningGrow decision tree to
its entiretyTrim the nodes of the decision tree in a bottom-up
fashionIf generalization error improves after trimming, replace
sub-tree by a leaf node.Class label of leaf node is determined
from majority class of instances in the sub-tree
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0.5)/30 = 10.5/30
Training Error (After splitting) = 7/30
Pessimistic error (After splitting)
PRUNE OR DO NOT PRUNEClass = Yes20Class = No10Error
= ?Class = Yes8Class = No4Class = Yes2Class = No5Class =
Yes6Class = No1Class = Yes4Class = No0
Handling Missing Attribute ValuesMissing values affect
decision tree construction in three different ways:Affects how
impurity measures are computedAffects how to distribute
instance with missing value to child nodesAffects how a test
instance with missing value is classified
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Metrics for Performance EvaluationFocus on the predictive
capability of a modelRather than how fast it takes to classify or
build models, scalability, etc.Confusion Matrix:
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)PREDICTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=YesabClass=Nocd
Metrics for Performance Evaluation…
Most widely-used metric:PREDICTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=Yesa (TP)b
(FN)Class=Noc (FP)d (TN)
Limitation of AccuracyConsider a 2-class problemNumber of
Class 0 examples = 9990Number of Class 1 examples = 10
If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %Accuracy is misleading because model
does not detect any class 1 example
Cost Matrix
C(i|j): Cost of misclassifying class j example as class i
PREDICTED CLASS
ACTUAL
CLASSC(i|j)Class=YesClass=NoClass=YesC(Yes|Yes)C(No|Yes
)Class=NoC(Yes|No)C(No|No)
Computing Cost of Classification
Accuracy = 80%
Cost = 3910
Accuracy = 90%
Cost = 4255Cost MatrixPREDICTED CLASS ACTUAL
CLASSC(i|j)+-+-1100-10Model M1PREDICTED CLASS
ACTUAL CLASS+-+15040-60250Model M2PREDICTED
CLASS ACTUAL CLASS+-+25045-5200
Cost vs AccuracyCountPREDICTED CLASS
ACTUAL
CLASSClass=YesClass=NoClass=YesabClass=NocdCostPREDI
CTED CLASS
ACTUAL CLASSClass=YesClass=NoClass=YespqClass=Noqp
Model EvaluationMetrics for Performance EvaluationHow to
evaluate the performance of a model?
Methods for Performance EvaluationHow to obtain reliable
estimates?
Methods for Model ComparisonHow to compare the relative
performance among competing models?
Methods for Performance EvaluationHow to obtain a reliable
estimate of performance?
Performance of a model may depend on other factors besides the
learning algorithm:Class distributionCost of
misclassificationSize of training and test sets
Methods of EstimationHoldoutReserve 2/3 for training and 1/3
for testing Random subsamplingRepeated holdoutCross
validationPartition data into k disjoint subsetsk-fold: train on k-
1 partitions, test on the remaining oneLeave-one-out:
k=nStratified sampling oversampling vs
undersamplingBootstrapSampling with replacement
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
A?
A1
A2
A3
A4
FN
FP
TN
TP
TN
TP
d
c
b
a
d
a
+
+
+
+
=
+
+
+
+
=
Accuracy
Dr. Oner Celepcikay
ITS 632
Data Mining
Algorithms: Clustering
Part I
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Finding groups of objects such that the objects in a group will
be similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Clustering Analysis
ITS 632
Inter-cluster distances are maximized
Intra-cluster distances are minimized
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Supervised Learning
Unsupervised Learning
ITS 632
Clustering Analysis
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Notion of a Cluster can be Ambiguous
How many clusters?
Four Clusters
Two Clusters
Six Clusters
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical Clustering
A set of nested clusters organized as a hierarchical tree
ITS 632
Types of Clustering
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Partitional Clustering
Original Points
A Partitional Clustering
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Hierarchical Clustering
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Clusters Defined by an Objective Function
Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters
and evaluate the `goodness' of each potential set of clusters by
using the given objective function.
Parameters for the model are determined from the data.
ITS 632
Types of Clustering: Objective Function
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Map the clustering problem to a different domain
Proximity matrix defines a weighted graph, where the nodes are
the points being clustered, and the weighted edges represent the
proximities between points
Clustering is equivalent to breaking the graph into connected
components, one for each cluster.
Want to minimize the edge weight between clusters and
maximize the edge weight within clusters
ITS 632
Types of Clustering: Objective Function
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
K-means and its variants
Hierarchical clustering
Density-based clustering
ITS 632
Clustering Algorithms
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster (closest centroid)
Number of clusters, K, must be specified
The basic algorithm is very simple
K-means Clustering
ITS 632
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is the mean of the points in the cluster.
‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in first few iterations.
K-means Clustering
ITS 632
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-means Clustering in Action
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-means Clustering in Action
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
K-means Clustering in Action
K-Means Animation
http://tech.nitoyon.com/en/blog/2013/11/07/k-means/
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Importance of Choosing Initial Centroids
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
ITS 632
Importance of Choosing Initial Centroids
Header – dark yellow 24 points Arial Bold
Body text – white 20 points Arial Bold, dark yellow highlights
Bullets – dark yellow
Copyright – white 12 points Arial
Size:
Height: 7.52"
Width: 10.02"
Scale: 70%
Position on slide:
Horizontal - 0"
Vertical - 0"
Multiple runs
Helps, but probability is not on your side
Sample & use hierarchical clustering to find K centroids
Select more than K initial centroids and then select among
these initial centroids
Select most widely separated
Postprocessing

More Related Content

Similar to •  Compareandcontrastthefourartworksprovided(KongoCr.docx

Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Data Mining - Introduction and Data
Data Mining - Introduction and DataData Mining - Introduction and Data
Data Mining - Introduction and DataDarío Garigliotti
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxwhittemorelucilla
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducOllieShoresna
 
Data Base Management system relation algebra ER diageam Sql Query -nested qu...
Data Base Management system relation algebra ER diageam Sql Query -nested  qu...Data Base Management system relation algebra ER diageam Sql Query -nested  qu...
Data Base Management system relation algebra ER diageam Sql Query -nested qu...kudiyarc
 
Pg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docxPg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docxkarlhennesey
 
Data science
Data scienceData science
Data scienceallytech
 
1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.pptAshok280385
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptxgina458018
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)Victor Ordu
 
#3 Information extraction from news to conversations
#3 Information extraction from news to conversations#3 Information extraction from news to conversations
#3 Information extraction from news to conversationsBerlin Language Technology
 
chap8_basic_cluster_analysis.ppt
chap8_basic_cluster_analysis.pptchap8_basic_cluster_analysis.ppt
chap8_basic_cluster_analysis.pptDarkkali1
 
Forecasting time series powerful and simple
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simpleIvo Andreev
 
Data displays in statistics
Data displays in statisticsData displays in statistics
Data displays in statisticsannieg8989
 

Similar to •  Compareandcontrastthefourartworksprovided(KongoCr.docx (20)

Lecture2.pdf
Lecture2.pdfLecture2.pdf
Lecture2.pdf
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Data Mining - Introduction and Data
Data Mining - Introduction and DataData Mining - Introduction and Data
Data Mining - Introduction and Data
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Datatyps in posgresql
Datatyps in posgresqlDatatyps in posgresql
Datatyps in posgresql
 
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docxData Mining DataLecture Notes for Chapter 2Introduc.docx
Data Mining DataLecture Notes for Chapter 2Introduc.docx
 
Data Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2IntroducData Mining DataLecture Notes for Chapter 2Introduc
Data Mining DataLecture Notes for Chapter 2Introduc
 
Chap2 data
Chap2 dataChap2 data
Chap2 data
 
Data Base Management system relation algebra ER diageam Sql Query -nested qu...
Data Base Management system relation algebra ER diageam Sql Query -nested  qu...Data Base Management system relation algebra ER diageam Sql Query -nested  qu...
Data Base Management system relation algebra ER diageam Sql Query -nested qu...
 
Pg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docxPg. 01Question Three Assignment 1Deadline Satur.docx
Pg. 01Question Three Assignment 1Deadline Satur.docx
 
Data science
Data scienceData science
Data science
 
1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptx
 
R Data Structures (Part 1)
R Data Structures (Part 1)R Data Structures (Part 1)
R Data Structures (Part 1)
 
#3 Information extraction from news to conversations
#3 Information extraction from news to conversations#3 Information extraction from news to conversations
#3 Information extraction from news to conversations
 
chap8_basic_cluster_analysis.ppt
chap8_basic_cluster_analysis.pptchap8_basic_cluster_analysis.ppt
chap8_basic_cluster_analysis.ppt
 
Forecasting time series powerful and simple
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simple
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Data For Datamining
Data For DataminingData For Datamining
Data For Datamining
 
Data displays in statistics
Data displays in statisticsData displays in statistics
Data displays in statistics
 

More from daynamckernon

Review the American Society for Public Administration (ASPA) Code .docx
Review the American Society for Public Administration (ASPA) Code .docxReview the American Society for Public Administration (ASPA) Code .docx
Review the American Society for Public Administration (ASPA) Code .docxdaynamckernon
 
Review two examples of action research this week by Terrell, 1999 & .docx
Review two examples of action research this week by Terrell, 1999 & .docxReview two examples of action research this week by Terrell, 1999 & .docx
Review two examples of action research this week by Terrell, 1999 & .docxdaynamckernon
 
Review both the Balance Sheet and Income Statement for XYZ Company.docx
Review both the Balance Sheet and Income Statement for XYZ Company.docxReview both the Balance Sheet and Income Statement for XYZ Company.docx
Review both the Balance Sheet and Income Statement for XYZ Company.docxdaynamckernon
 
Review your problem or issue and the cultural assessment. Consid.docx
Review your problem or issue and the cultural assessment. Consid.docxReview your problem or issue and the cultural assessment. Consid.docx
Review your problem or issue and the cultural assessment. Consid.docxdaynamckernon
 
Review the Standard costs wake up and smell the coffee.articl.docx
Review the Standard costs wake up and smell the coffee.articl.docxReview the Standard costs wake up and smell the coffee.articl.docx
Review the Standard costs wake up and smell the coffee.articl.docxdaynamckernon
 
Review the Week 5 readings and videos.Create a 5- to 8-slide Mic.docx
Review the Week 5 readings and videos.Create a 5- to 8-slide Mic.docxReview the Week 5 readings and videos.Create a 5- to 8-slide Mic.docx
Review the Week 5 readings and videos.Create a 5- to 8-slide Mic.docxdaynamckernon
 
Review the two examples of action research (Terrell, 1999 & Hicok, 2.docx
Review the two examples of action research (Terrell, 1999 & Hicok, 2.docxReview the two examples of action research (Terrell, 1999 & Hicok, 2.docx
Review the two examples of action research (Terrell, 1999 & Hicok, 2.docxdaynamckernon
 
Review The Surgeon General’s Vision for a Healthy and Fit Nation 2.docx
Review The Surgeon General’s Vision for a Healthy and Fit Nation 2.docxReview The Surgeon General’s Vision for a Healthy and Fit Nation 2.docx
Review The Surgeon General’s Vision for a Healthy and Fit Nation 2.docxdaynamckernon
 
Review the Project Management email.Write an email respons.docx
Review the Project Management email.Write an email respons.docxReview the Project Management email.Write an email respons.docx
Review the Project Management email.Write an email respons.docxdaynamckernon
 
Review the four main functions of management, which are planning, or.docx
Review the four main functions of management, which are planning, or.docxReview the four main functions of management, which are planning, or.docx
Review the four main functions of management, which are planning, or.docxdaynamckernon
 
Review the Huston (2010) article listed under reading assignments. W.docx
Review the Huston (2010) article listed under reading assignments. W.docxReview the Huston (2010) article listed under reading assignments. W.docx
Review the Huston (2010) article listed under reading assignments. W.docxdaynamckernon
 
Review the public relations communications instruments in Chapter 10.docx
Review the public relations communications instruments in Chapter 10.docxReview the public relations communications instruments in Chapter 10.docx
Review the public relations communications instruments in Chapter 10.docxdaynamckernon
 
Review the major aspects of how the human immune system functions. H.docx
Review the major aspects of how the human immune system functions. H.docxReview the major aspects of how the human immune system functions. H.docx
Review the major aspects of how the human immune system functions. H.docxdaynamckernon
 
Review the list of names provided in the University of Phoenix M.docx
Review the list of names provided in the University of Phoenix M.docxReview the list of names provided in the University of Phoenix M.docx
Review the list of names provided in the University of Phoenix M.docxdaynamckernon
 
Review the following people in order of historical importance. Ran.docx
Review the following people in order of historical importance. Ran.docxReview the following people in order of historical importance. Ran.docx
Review the following people in order of historical importance. Ran.docxdaynamckernon
 
Review the following people in order of historical importance. Rank .docx
Review the following people in order of historical importance. Rank .docxReview the following people in order of historical importance. Rank .docx
Review the following people in order of historical importance. Rank .docxdaynamckernon
 
Review the details of the case Authority and Leadership Rising From.docx
Review the details of the case Authority and Leadership Rising From.docxReview the details of the case Authority and Leadership Rising From.docx
Review the details of the case Authority and Leadership Rising From.docxdaynamckernon
 
Review the following ethical dilemmasJohn Doe has decided to .docx
Review the following ethical dilemmasJohn Doe has decided to .docxReview the following ethical dilemmasJohn Doe has decided to .docx
Review the following ethical dilemmasJohn Doe has decided to .docxdaynamckernon
 
Review the following articles to assist you with this assignmentB.docx
Review the following articles to assist you with this assignmentB.docxReview the following articles to assist you with this assignmentB.docx
Review the following articles to assist you with this assignmentB.docxdaynamckernon
 
Review the ESL virtual classroom by clicking on the resource link in.docx
Review the ESL virtual classroom by clicking on the resource link in.docxReview the ESL virtual classroom by clicking on the resource link in.docx
Review the ESL virtual classroom by clicking on the resource link in.docxdaynamckernon
 

More from daynamckernon (20)

Review the American Society for Public Administration (ASPA) Code .docx
Review the American Society for Public Administration (ASPA) Code .docxReview the American Society for Public Administration (ASPA) Code .docx
Review the American Society for Public Administration (ASPA) Code .docx
 
Review two examples of action research this week by Terrell, 1999 & .docx
Review two examples of action research this week by Terrell, 1999 & .docxReview two examples of action research this week by Terrell, 1999 & .docx
Review two examples of action research this week by Terrell, 1999 & .docx
 
Review both the Balance Sheet and Income Statement for XYZ Company.docx
Review both the Balance Sheet and Income Statement for XYZ Company.docxReview both the Balance Sheet and Income Statement for XYZ Company.docx
Review both the Balance Sheet and Income Statement for XYZ Company.docx
 
Review your problem or issue and the cultural assessment. Consid.docx
Review your problem or issue and the cultural assessment. Consid.docxReview your problem or issue and the cultural assessment. Consid.docx
Review your problem or issue and the cultural assessment. Consid.docx
 
Review the Standard costs wake up and smell the coffee.articl.docx
Review the Standard costs wake up and smell the coffee.articl.docxReview the Standard costs wake up and smell the coffee.articl.docx
Review the Standard costs wake up and smell the coffee.articl.docx
 
Review the Week 5 readings and videos.Create a 5- to 8-slide Mic.docx
Review the Week 5 readings and videos.Create a 5- to 8-slide Mic.docxReview the Week 5 readings and videos.Create a 5- to 8-slide Mic.docx
Review the Week 5 readings and videos.Create a 5- to 8-slide Mic.docx
 
Review the two examples of action research (Terrell, 1999 & Hicok, 2.docx
Review the two examples of action research (Terrell, 1999 & Hicok, 2.docxReview the two examples of action research (Terrell, 1999 & Hicok, 2.docx
Review the two examples of action research (Terrell, 1999 & Hicok, 2.docx
 
Review The Surgeon General’s Vision for a Healthy and Fit Nation 2.docx
Review The Surgeon General’s Vision for a Healthy and Fit Nation 2.docxReview The Surgeon General’s Vision for a Healthy and Fit Nation 2.docx
Review The Surgeon General’s Vision for a Healthy and Fit Nation 2.docx
 
Review the Project Management email.Write an email respons.docx
Review the Project Management email.Write an email respons.docxReview the Project Management email.Write an email respons.docx
Review the Project Management email.Write an email respons.docx
 
Review the four main functions of management, which are planning, or.docx
Review the four main functions of management, which are planning, or.docxReview the four main functions of management, which are planning, or.docx
Review the four main functions of management, which are planning, or.docx
 
Review the Huston (2010) article listed under reading assignments. W.docx
Review the Huston (2010) article listed under reading assignments. W.docxReview the Huston (2010) article listed under reading assignments. W.docx
Review the Huston (2010) article listed under reading assignments. W.docx
 
Review the public relations communications instruments in Chapter 10.docx
Review the public relations communications instruments in Chapter 10.docxReview the public relations communications instruments in Chapter 10.docx
Review the public relations communications instruments in Chapter 10.docx
 
Review the major aspects of how the human immune system functions. H.docx
Review the major aspects of how the human immune system functions. H.docxReview the major aspects of how the human immune system functions. H.docx
Review the major aspects of how the human immune system functions. H.docx
 
Review the list of names provided in the University of Phoenix M.docx
Review the list of names provided in the University of Phoenix M.docxReview the list of names provided in the University of Phoenix M.docx
Review the list of names provided in the University of Phoenix M.docx
 
Review the following people in order of historical importance. Ran.docx
Review the following people in order of historical importance. Ran.docxReview the following people in order of historical importance. Ran.docx
Review the following people in order of historical importance. Ran.docx
 
Review the following people in order of historical importance. Rank .docx
Review the following people in order of historical importance. Rank .docxReview the following people in order of historical importance. Rank .docx
Review the following people in order of historical importance. Rank .docx
 
Review the details of the case Authority and Leadership Rising From.docx
Review the details of the case Authority and Leadership Rising From.docxReview the details of the case Authority and Leadership Rising From.docx
Review the details of the case Authority and Leadership Rising From.docx
 
Review the following ethical dilemmasJohn Doe has decided to .docx
Review the following ethical dilemmasJohn Doe has decided to .docxReview the following ethical dilemmasJohn Doe has decided to .docx
Review the following ethical dilemmasJohn Doe has decided to .docx
 
Review the following articles to assist you with this assignmentB.docx
Review the following articles to assist you with this assignmentB.docxReview the following articles to assist you with this assignmentB.docx
Review the following articles to assist you with this assignmentB.docx
 
Review the ESL virtual classroom by clicking on the resource link in.docx
Review the ESL virtual classroom by clicking on the resource link in.docxReview the ESL virtual classroom by clicking on the resource link in.docx
Review the ESL virtual classroom by clicking on the resource link in.docx
 

Recently uploaded

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Recently uploaded (20)

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

•  Compareandcontrastthefourartworksprovided(KongoCr.docx

  • 1. • Compareandcontrastthefourartworksprovided(Kongo Crucifix,EthiopianCross,BustofOoni,andJosyAjiboye’s Oonipainting). • UseartworksA,BandCtodiscussthelost-waxprocessof bronzecasting • Usetheconceptofculturalappropriationtodiscusshow artworksAandBadaptedChristianity,andartworkD adaptedmodernisminAfricanart. Note: • Interpretationoftheseartworksshoulddiscusstheirorigins, historicalrelevance,functionsandsymbolism. • Defineyourtermsproperlysothattheysupportyouranalysisofthe images.Appropriationmeansadaptation,asinwhenartistsinone contexttranslateandtransformcross-culturalconceptsfromanother context. • Citeallreferences,especiallyonlinereferencesusingfootnotes • Avoidmerelyduplicatinginformationfromexternalsources: plagiarismwillbeseriouslypenalized. (A)Left:KongoKingdomofSaoSalvador,Crucifix,brass,17thCentu ryA.D.
  • 2. (B)Right:Ethiopia(Aksum),ProcessionalCross,bronze,14th- 15thCenturyA.D. (C)Left:YorubaPeoples(IfeKingdom),BustofOoni(king),bronze,1 0th-14thCenturiesA.D. (D)Right:JosyAjiboye,Ooni,oiloncanvas,1976 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dr. Oner Celepcikay ITS 632 Data Mining Summer 2019Week 3: Data and Data Exploration © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Chapter 2: Data
  • 3. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› What is Data? ● Collection of data objects and their attributes ● An attribute is a property or characteristic of an object – Examples: eye color of a person, temperature, etc. – Attribute is also known as variable, field, characteristic, or feature ● A collection of attributes describe an object – Object is also known as record, point, case, sample, entity, or instance Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No
  • 4. 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Attribute Values ● Attribute values are numbers or symbols assigned to an attribute ● Distinction between attributes and attribute values – Same attribute can be mapped to different attribute values u Example: height can be measured in feet or meters
  • 5. – Different attributes can be mapped to the same set of values u Example: Attribute values for ID and age are integers u But properties of attribute values can be different – ID has no limit but age has a maximum and minimum value – Some operations are meaningful on age but meaningless on ID © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Attributes ● There are different types of attributes – Nominal u Examples: ID numbers, eye color, zip codes – Ordinal u Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} – Interval u Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio
  • 6. u Examples: temperature in Kelvin, length, time, counts Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ¹) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, c2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the
  • 7. differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it
  • 8. make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}. Interval new_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Properties of Attribute Values ● The type of an attribute depends on which of the following properties it possesses:
  • 9. – Distinctness: = ¹ – Order: < > – Addition: + - – Multiplication: * / – Nominal attribute: distinctness – Ordinal attribute: distinctness & order – Interval attribute: distinctness, order & addition – Ratio attribute: all 4 properties © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Discrete and Continuous Attributes ● Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes ● Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – Practically, real values can only be measured and represented
  • 10. using a finite number of digits. – Continuous attributes are typically represented as floating- point variables. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of data sets ● Record – Data Matrix – Document Data – Transaction Data ● Graph – World Wide Web – Molecular Structures ● Ordered – Spatial Data – Temporal Data – Sequential Data – Genetic Sequence Data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Important Characteristics of Structured Data
  • 11. – Dimensionality u Curse of Dimensionality – Sparsity u Only presence counts – Resolution u Patterns depend on the scale – Examples: Texas data, Aleks, Simpson’s Paradox © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Record Data ● Data that consists of a collection of records, each of which consists of a fixed set of attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes
  • 12. 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Matrix ● If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute ● Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute 1.12.216.226.2512.65
  • 13. 1.22.715.225.2710.23 Thickness LoadDistanceProjection of y load Projection of x Load 1.12.216.226.2512.65 1.22.715.225.2710.23 Thickness LoadDistanceProjection of y load Projection of x Load © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Document Data ● Each document becomes a `term' vector, – each term is a component (attribute) of the vector, – the value of each component is the number of times the corresponding term occurs in the document. – In practice only non-0 is stored Document 1 season
  • 14. tim eout lost w in gam e score ball play coach team Document 2 Document 3 3 0 5 0 2 6 0 2 0 2 0 0 7 0 2 1 0 0 3 0 0 1 0 0 1 2 2 0 3 0
  • 15. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Transaction Data ● A special type of record data, where – each record (transaction) involves a set of items. – For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Graph Data
  • 16. ● Examples: Generic graph and HTML Links ● Data objects are nodes, links are properties 5 2 1 2 5 <a href="papers/papers.html#bbbb"> Data Mining </a> <li> <a href="papers/papers.html#aaaa"> Graph Partitioning </a> <li> <a href="papers/papers.html#aaaa"> Parallel Solution of Sparse Linear System of Equations </a> <li> <a href="papers/papers.html#ffff"> N-Body Computation and Dense Linear System Solvers
  • 17. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Chemical Data ● Benzene Molecule: C6H6 ● Nodes are atoms, links are chemical bonds ● helps to identify substructures. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Ordered Data ● Sequences of transactions An element of the sequence Items/Events
  • 18. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Ordered Data ● Genomic sequence data ● Similar to sequential data but no time stamps GGTTCCGCCTTCAGCCCCGCGCC CGCAGGGCCCGCCCCGCGCCGTC GAGAAGGGCCCGCCTGGCGGGCG GGGGGAGGCGGGGCCGCCCGAGC CCAACCGAGTCCGACCAGGTGCC CCCTCTGCTCGGCCTAGACCTGA GCTCATTAGGCGGCAGCGGACAG GCCAAGTAGAACACGCGAAGCGC TGGGCTGCCTGCTGCGACCAGGG © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
  • 19. Ordered Data ● Spatio-Temporal Data Average Monthly Temperature of land and ocean © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Quality ● What kinds of data quality problems? ● How can we detect problems with the data? ● What can we do about these problems? ● Examples of data quality problems: – Noise and outliers
  • 20. – missing values – duplicate data © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Data Quality ● Precision: The closeness of repeated measurements (of the same quantity) to other measurements. ● Bias: A systematic variation of measurements from the quantity being measured. ● Accuracy: The closeness of measurements to the true value of the quantity being measurement. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
  • 21. Noise ● Noise refers to modification of original values – Examples: distortion of a person�s voice when talking on a poor phone and �snow� on television screen Two Sine Waves Two Sine Waves + Noise © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Outliers ● Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set (diff. than noise) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
  • 22. Missing Values ● Reasons for missing values – Information is not collected (e.g., people decline to give their age and weight) – Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) ● Handling missing values – Eliminate Data Objects (unless many missing) – Estimate Missing Values (avg., most common val.) – Ignore the Missing Value During Analysis – Replace with all possible values (weighted by their probabilities)
  • 23. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Duplicate Data ● Data set may include data objects that are duplicates, or almost duplicates of one another – Major issue when merging data from heterogeous sources – Also attention needed to avoid combining 2 very similar objects into 1. ● Examples: – Same person with multiple email addresses ● Data cleaning – Process of dealing with duplicate data issues © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
  • 24. Data Preprocessing ● Aggregation ● Sampling ● Dimensionality Reduction ● Feature subset selection ● Feature creation ● Discretization and Binarization ● Attribute Transformation © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Aggregation ● Combining two or more attributes (or objects) into
  • 25. a single attribute (or object) ● Purpose – Data reduction u Reduce the number of attributes or objects – Change of scale u Cities aggregated into regions, states, countries, etc – More �stable� data u Aggregated data tends to have less variability © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Aggregation-Why? ● Less memory & less processing times – Aggregation allows to use very expensive Algorithms ● High level view of the data – Store example
  • 26. ● Behavior of groups of objects often more stable than individual objects. – A disadvantage of this is losing information or patterns, – e.g. if you aggregate days into months, you might miss the sales peak in Valentine’s Day. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia
  • 27. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sampling ● Sampling is the main technique employed for data selection. – It is often used for both the preliminary investigation of the data and the final data analysis. ● Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. ● Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sampling …
  • 28. ● The key principle for effective sampling is the following: – using a sample will work almost as well as using the entire data sets, if the sample is representative – A sample is representative if it has approximately the same property (of interest) as the original set of data – If mean is of interest then the mean of the sample, should be similar to mean of the full data. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Sampling ● Simple Random Sampling
  • 29. – There is an equal probability of selecting any particular item ● Sampling without replacement – As each item is selected, it is removed from the population ● Sampling with replacement – Objects are not removed from the population as they are selected for the sample. u In sampling with replacement, the same object can be picked up more than once (easier to analyze, probability is constant) ● Stratified sampling – Split the data into several partitions; then draw random samples from each partition (handles representation of less freq. objects) © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sample Size
  • 30. 8000 points 2000 Points 500 Points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Sample Size ● What sample size is necessary to get at least one object from each of 10 groups. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Curse of Dimensionality ● When dimensionality increases, data becomes increasingly sparse in the space that it occupies
  • 31. ● Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful • Randomly generate 500 points • Compute difference between max and min distance between any pair of points © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction ● Purpose: – Avoid curse of dimensionality – Reduce amount of time and memory required by data mining algorithms – Allow data to be more easily visualized
  • 32. – May help to eliminate irrelevant features or reduce noise ● Techniques – Principle Component Analysis – Singular Value Decomposition – Others: supervised and non-linear techniques © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction: PCA ● Goal is to find a projection that captures the largest amount of variation in data x2 x1 e
  • 33. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensionality Reduction: PCA ● Find the eigenvectors of the covariance matrix ● The eigenvectors define the new space ● Tends to identify strongest patterns in data. x2 x1 e © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Dimensions = 10Dimensions = 40Dimensions = 80Dimensions =
  • 34. 120Dimensions = 160Dimensions = 206 Dimensionality Reduction: PCA © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Face detection and recognition Detection Recognition “Sally” © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Feature Subset Selection ● Another way to reduce dimensionality of data ● Redundant features – duplicate much or all of the information contained in
  • 35. one or more other attributes – Example: purchase price of a product and the amount of sales tax paid ● Irrelevant features – contain no information that is useful for the data mining task at hand – Example: students' ID is often irrelevant to the task of predicting students' GPA © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Feature Subset Selection ● Techniques: – Brute-force approch: uTry all possible feature subsets as input to data mining algorithm
  • 36. – Embedded approaches: u Feature selection occurs naturally as part of the data mining algorithm – Filter approaches: u Features are selected before data mining algorithm is run – Wrapper approaches: u Use the data mining algorithm as a black box to find best subset of attributes © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Feature Subset Selection © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
  • 37. Feature Creation ● Create new attributes that can capture the important information in a data set much more efficiently than the original attributes ● Three general methodologies: – Feature Extraction u domain-specific – Mapping Data to New Space – Feature Construction u combining features (pixels à edges for face recognition) u e.g. using density instead of mass, volume in identifying artifacts such as gold, bronze, clay, etc… © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity and Dissimilarity
  • 38. ● Similarity – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. – Often falls in the range [0,1] ● Dissimilarity – Numerical measure of how different are two data objects – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies ● Proximity refers to a similarity or dissimilarity © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
  • 39. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Similarity/Dissimilarity for Simple Attributes ● An example: quality of a product (e.g. candy) {poor, fair, OK, good, wonderful} ● P1->Wonderful, P->2 good, P3->OK ● P1 is closer to P2 than it is to P3 ● Map ordinal attributes into integers: {poor=0, fair=1, OK=2, good=3, wonderful=4} ● Estimate the distance values for each pair. ● Normalize if you want [1,1] interval © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Euclidean Distance ● Euclidean Distance
  • 40. Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. ● Standardization is necessary, if scales differ. å = -= n k kk qpdist 1 2)( © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
  • 41. Euclidean Distance 0 1 2 3 0 1 2 3 4 5 6 p1 p2 p3 p4 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 Distance Matrix
  • 42. p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Minkowski Distance ● Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. r n
  • 43. k r kk qpdist 1 1 )||( å = -= © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Minkowski Distance: Examples ● r = 1. City block (Manhattan, taxicab, L1 norm) distance. – A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors
  • 44. ● r = 2. Euclidean distance ● r ® ¥. �supremum� (Lmax norm, L¥ norm) distance. – This is the maximum difference between any component of the vectors ● Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Minkowski Distance Distance Matrix point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1
  • 45. L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 L2 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L¥ p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Common Properties of a Distance
  • 46. ● Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q) ³ 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2. d(p, q) = d(q, p) for all p and q. (Symmetry) 3. d(p, r) £ d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. ● A distance that satisfies these properties is a metric ● Examples 2.14 and 2.15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
  • 47. Common Properties of a Similarity ● Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0)
  • 48. M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Cosine Similarity ● If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| , where • indicates vector dot product and || d || is the length of vector d. ● Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2
  • 49. d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Correlation ● Correlation measures the linear relationship between objects ● To compute correlation, we standardize data objects, p and q, and then take their dot product )(/))(( pstdpmeanpp kk -=¢
  • 50. )(/))(( qstdqmeanqq kk -=¢ qpqpncorrelatio ¢•¢=),( © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Correlation ● Correlation measures the linear relationship between objects ● To compute correlation, we standardize data objects, p and q, and then take their dot product © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Visually Evaluating Correlation
  • 51. Scatter plots showing the similarity from –1 to 1. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Density ● Density-based clustering require a notion of density ● Examples: – Euclidean density u Euclidean density = number of points per unit volume – Probability density – Graph-based density
  • 52. © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Euclidean Density – Cell-based ● Simplest approach is to divide region into a number of rectangular cells of equal volume and define density as # of points the cell contains © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Euclidean Density – Center-based ● Euclidean density is the number of points within a specified radius of the point Dr. Oner Celepcikay CS 4319 CS 4319
  • 53. Classification Spring 2019 Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Machine Learning Methods - Classification CS 4319 Given a collection of records (training set) - Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes.
  • 54. A test set is used to estimate the accuracy of the model. Goal: previously unseen records (test set) should be assigned a class as accurately as possible. Machine Learning – Classification Example CS 4319 categorical categorical continuous class Test Set Training Set
  • 56. NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Model: Decision Tree Machine Learning – Classification Example categorical categorical continuous CS 4319 class
  • 58. > 80K There could be more than one tree that fits the same data! categorical categorical continuous Another Example of Decision Tree CS 4319 Test Data Start from the root of tree. Refund
  • 59. MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Apply Model to Test Data CS 4319 Test Data Start from the root of tree.
  • 61. CS 4319 Test Data Start from the root of tree. Refund MarSt TaxInc YES NO NO
  • 62. NO Yes No Married Single, Divorced < 80K > 80K Apply Model to Test Data CS 4319 Test Data Start from the root of tree. Refund MarSt
  • 63. TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Apply Model to Test Data CS 4319 Test Data Start from the root of tree.
  • 65. Test Data Start from the root of tree. Apply Model to Test Data CS 4319 No Refund MarSt TaxInc YES NO
  • 66. NO NO Yes No Married Single, Divorced < 80K > 80K Machine Learning – Classification Example CS 4319 categorical categorical continuous class Model
  • 67. Learning Algorithm Induction Deduction General Structure of Hunt’s Algorithm Let Dt be the set of training records that reach a node t General Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.
  • 71. CS 4319 British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms that replaced an earlier rule-based expert system. We will do a similar (but simpler) decision tree example towards the end of the semester. Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting Tree Induction CS 4319
  • 72. How to determine the Best Split CS 4319 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? How to determine the Best Split CS 4319 Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous,
  • 73. Low degree of impurity Measures of Node Impurity CS 4319 Gini Index Entropy Misclassification error How to Find the Best Split CS 4319 B? Yes No Node N3 Node N4 A?
  • 74. Yes No Node N1 Node N2 Before Splitting: M0 M1 M2 M3 M4
  • 75. M12 M34 Gain = M0 – M12 vs M0 – M34 Measure of Impurity: GINI CS 4319 Gini Index for a given node t : Need a measure of node impurity: (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (0.5) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information
  • 76. Examples for computing GINI CS 4319 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
  • 77. Examples for computing GINI CS 4319 A? Yes No Node N1 Node N2 Gini(N1) = 1 – (4/7)2 – (3/7)2 = 0.4898 Gini(N2) = 1 – (2/5)2 – (3/5)2 = 0.48 Gini(Children)
  • 78. = 7/12 * 0.4898 + 5/12 * 0.48 = 0.486 Examples for computing GINI CS 4319 B? Yes No Node N1 Node N2 Gini(N1) = 1 – (/)2 – (/)2 = Gini(N2)
  • 79. = 1 – (/)2 – (/)2 = Gini(Children) = Classification error at a node t : Measures misclassification error made by a node. Maximum (0.5) when records are equally distributed among all classes, implying least interesting information Minimum (0) when all records belong to one class, implying most interesting information Splitting Criteria based on Classification Error CS 4319
  • 80. Splitting Criteria based on Classification Error CS 4319 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 Greedy strategy. Split the records based on an attribute test that optimizes certain criterion.
  • 81. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting (Next class!) ANY IDEAS?? Tree Induction CS 4319 Classification Methods CS 4319 Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines Tid
  • 98. Taxable Income Cheat No Married 80K ? 10 Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Refund Marital Status Taxable Income Cheat
  • 100. Taxable Income Cheat No Married 80K ? 10 Refund Marital Status Taxable Income Cheat No Married 80K ? 10 Refund Marital Status Taxable Income Cheat No
  • 106. No Married 80K ? 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K
  • 107. No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2
  • 113. C0: 8 C1: 0 C0: 1 C1: 7 Car Type? C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Student ID? ... Yes No Family Sports Luxuryc 1 c 10 c 20
  • 114. C0: 0 C1: 1 ... c 11 Own Car?� C0: 6 C1: 4� C0: 4 C1: 6� Car Type?� C0: 1 C1: 3� C0: 8 C1: 0� C0: 1 C1: 7� C0: 1 C1: 0� C0: 1 C1: 0� C0: 0 C1: 1� Student ID?� ...�
  • 115. Yes� No� Family� Sports� Luxury� c1� c10� c20� C0: 0 C1: 1� ...� c11� C0: 5 C1: 5 C0: 9 C1: 1 C0: 5 C1: 5� C0: 9 C1: 1� C0 N10 C1 N11 C0 N20
  • 116. C1 N21 C0 N30 C1 N31 C0 N40 C1 N41 C0 N00 C1 N01 C0 N40C1 N41C0 N00C1 N01C0 N10C1 N11C0 N20C1 N21C0 N30C1 N31
  • 121. C1 6 C2 6 Gini = 0.500 N1 N2 C1 4 2 C2 3 3 Gini=0.486 N1 N2 C1 4 2 C2 3 3 Gini=0. 486
  • 122. ParentC1 6C2 6 Gini = 0.500 N1 N2C1 4 2C2 3 3Gini=0.486 N1 N2 C1 1 5 C2 4 2 Gini=? N1 N2 C1 1 5
  • 123. C2 4 2 Gini= ? ParentC1 6C2 6 Gini = 0.500 N1 N2C1 1 5C2 4 2Gini=? ) | ( max 1 ) ( t i P t
  • 124. Error i - =C1 1C2 5C1 0C2 6C1 2C2 4 Dr. Oner Celepcikay CS 4319 CS 4319 Machine Learning Week 6 Data Science Tool I – Classification Part II Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow
  • 125. Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Tree InductionGreedy strategy.Split the records based on an attribute test that optimizes certain criterion. IssuesDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later)
  • 126. Practical Issues of ClassificationUnderfitting and Overfitting Missing Values Costs of Classification Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting due to Noise Decision boundary is distorted by noise point Overfitting due to Noise * Bats and Whales are misclassified; non-mammals instead of mammals.
  • 127. Overfitting due to Noise Decision boundary is distorted by noise point Both humans and dolphins were misclassified as n0n-mammals b/c Body Temp, Gives_Birth and Four-legged values are identical to mislabeled records in training set. Spiny anteaters represent an exceptional case (every warm- blooded with no gives_birth is non-mammal in TR_Set Decision tree perfectly fits training data (training error=0) But error rate on test data is 30%. Overfitting due to Noise Estimating Generalization ErrorsRe-substitution errors: error on Methods for estimating generalization errors:Optimistic approach: e’(t) = e(t)Pessimistic approach: For each leaf (N: number of leaf nodes) For a tree with 30 leaf nodes and 10 errors on training
  • 128. (out of 1000 instances): Training error = 10/1000 = 1% 2.5%Reduced error pruning (REP): uses validation data set to estimate generalization error Occam’s RazorGiven two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex models, there is a greater chance that it was fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model How to Address OverfittingPre-Pruning (Early Stopping Rule)Stop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the
  • 129. sameMore restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., current node does not improve impurity measures (e.g., Gini or information gain). How to Address Overfitting…Post-pruningGrow decision tree to its entiretyTrim the nodes of the decision tree in a bottom-up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node.Class label of leaf node is determined from majority class of instances in the sub-tree Example of Post-Pruning Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 7/30 Pessimistic error (After splitting) PRUNE OR DO NOT PRUNEClass = Yes20Class = No10Error
  • 130. = ?Class = Yes8Class = No4Class = Yes2Class = No5Class = Yes6Class = No1Class = Yes4Class = No0 Handling Missing Attribute ValuesMissing values affect decision tree construction in three different ways:Affects how impurity measures are computedAffects how to distribute instance with missing value to child nodesAffects how a test instance with missing value is classified Model EvaluationMetrics for Performance EvaluationHow to evaluate the performance of a model? Methods for Performance EvaluationHow to obtain reliable estimates? Methods for Model ComparisonHow to compare the relative performance among competing models? Model EvaluationMetrics for Performance EvaluationHow to evaluate the performance of a model?
  • 131. Methods for Performance EvaluationHow to obtain reliable estimates? Methods for Model ComparisonHow to compare the relative performance among competing models? Metrics for Performance EvaluationFocus on the predictive capability of a modelRather than how fast it takes to classify or build models, scalability, etc.Confusion Matrix: a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)PREDICTED CLASS ACTUAL CLASSClass=YesClass=NoClass=YesabClass=Nocd Metrics for Performance Evaluation… Most widely-used metric:PREDICTED CLASS
  • 132. ACTUAL CLASSClass=YesClass=NoClass=Yesa (TP)b (FN)Class=Noc (FP)d (TN) Limitation of AccuracyConsider a 2-class problemNumber of Class 0 examples = 9990Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 %Accuracy is misleading because model does not detect any class 1 example Cost Matrix C(i|j): Cost of misclassifying class j example as class i PREDICTED CLASS ACTUAL CLASSC(i|j)Class=YesClass=NoClass=YesC(Yes|Yes)C(No|Yes )Class=NoC(Yes|No)C(No|No) Computing Cost of Classification
  • 133. Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255Cost MatrixPREDICTED CLASS ACTUAL CLASSC(i|j)+-+-1100-10Model M1PREDICTED CLASS ACTUAL CLASS+-+15040-60250Model M2PREDICTED CLASS ACTUAL CLASS+-+25045-5200 Cost vs AccuracyCountPREDICTED CLASS ACTUAL CLASSClass=YesClass=NoClass=YesabClass=NocdCostPREDI CTED CLASS ACTUAL CLASSClass=YesClass=NoClass=YespqClass=Noqp Model EvaluationMetrics for Performance EvaluationHow to evaluate the performance of a model? Methods for Performance EvaluationHow to obtain reliable estimates? Methods for Model ComparisonHow to compare the relative performance among competing models?
  • 134. Methods for Performance EvaluationHow to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm:Class distributionCost of misclassificationSize of training and test sets Methods of EstimationHoldoutReserve 2/3 for training and 1/3 for testing Random subsamplingRepeated holdoutCross validationPartition data into k disjoint subsetsk-fold: train on k- 1 partitions, test on the remaining oneLeave-one-out: k=nStratified sampling oversampling vs undersamplingBootstrapSampling with replacement Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02"
  • 135. Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" A? A1 A2 A3 A4 FN FP TN TP TN TP d c b a d a + + + +
  • 136. = + + + + = Accuracy Dr. Oner Celepcikay ITS 632 Data Mining Algorithms: Clustering Part I Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70%
  • 137. Position on slide: Horizontal - 0" Vertical - 0" Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Clustering Analysis ITS 632
  • 138. Inter-cluster distances are maximized Intra-cluster distances are minimized Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow
  • 139. Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Supervised Learning Unsupervised Learning ITS 632 Clustering Analysis Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide:
  • 140. Horizontal - 0" Vertical - 0" ITS 632 Notion of a Cluster can be Ambiguous
  • 143. Six Clusters Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow
  • 144. Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical Clustering A set of nested clusters organized as a hierarchical tree ITS 632 Types of Clustering Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52"
  • 145. Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" ITS 632 Partitional Clustering
  • 146. Original Points A Partitional Clustering Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" ITS 632 Hierarchical Clustering
  • 147. Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Clusters Defined by an Objective Function Finds clusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function. Parameters for the model are determined from the data. ITS 632 Types of Clustering: Objective Function Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights
  • 148. Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Map the clustering problem to a different domain Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points Clustering is equivalent to breaking the graph into connected components, one for each cluster. Want to minimize the edge weight between clusters and maximize the edge weight within clusters ITS 632 Types of Clustering: Objective Function Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights
  • 149. Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" K-means and its variants Hierarchical clustering Density-based clustering ITS 632 Clustering Algorithms Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size:
  • 150. Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster (closest centroid) Number of clusters, K, must be specified The basic algorithm is very simple K-means Clustering ITS 632 Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52"
  • 151. Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in first few iterations. K-means Clustering ITS 632 Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52"
  • 152. Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" ITS 632 K-means Clustering in Action Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02"
  • 153. Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" ITS 632 K-means Clustering in Action Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02"
  • 154. Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" ITS 632 K-means Clustering in Action K-Means Animation http://tech.nitoyon.com/en/blog/2013/11/07/k-means/ Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0"
  • 155. ITS 632 Importance of Choosing Initial Centroids Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" ITS 632
  • 156. Importance of Choosing Initial Centroids Header – dark yellow 24 points Arial Bold Body text – white 20 points Arial Bold, dark yellow highlights Bullets – dark yellow Copyright – white 12 points Arial Size: Height: 7.52" Width: 10.02" Scale: 70% Position on slide: Horizontal - 0" Vertical - 0" Multiple runs Helps, but probability is not on your side Sample & use hierarchical clustering to find K centroids
  • 157. Select more than K initial centroids and then select among these initial centroids Select most widely separated Postprocessing