SlideShare a Scribd company logo
U N I V E R S I T Y O F S O U T H F L O R I D A // 1
U N I V E R S I T Y O F S O U T H F L O R I D A //
U N I V E R S I T Y O F S O U T H F L O R I D A //
Segmentation: Clustering and
Classification
Dr. S. Shivendu
U N I V E R S I T Y O F S O U T H F L O R I D A // 2
U N I V E R S I T Y O F S O U T H F L O R I D A // 2
Objectives
Overview of Segmentation
Perform clustering and classification.
01
Perform structural equation modeling.
02
Perform programming in SAS to do factor analysis
03
U N I V E R S I T Y O F S O U T H F L O R I D A // 3
U N I V E R S I T Y O F S O U T H F L O R I D A // 3
Agenda
Overview of Statistical Concepts
Segmentation: Clustering and Classification
Structural Equation Modeling
Factor Analysis in SAS
U N I V E R S I T Y O F S O U T H F L O R I D A // 4
U N I V E R S I T Y O F S O U T H F L O R I D A // 4
Course Textbooks
U N I V E R S I T Y O F S O U T H F L O R I D A // 5
Segmentation
• Classifying or dividing set of objects or consumers or people into different groups
• Market segmentation refers to dividing consumers into different distinct groups based on some
characteristics
• Statistical classification refers to using statistical tools to divide the population in sub groups
• Machine learning methods are also used for dividing the population into subgroups
• We will cover two approaches today
• Clustering
• Machine learning method – Decision Trees – for classification (very briefly)
U N I V E R S I T Y O F S O U T H F L O R I D A // 6
Unsupervised learning: Clustering
• The organization of unlabeled data into similarity groups is called clusters.
• Therefore, a cluster is a collection of data items which are “similar” between them, and “dissimilar”
to data items in other clusters.
U N I V E R S I T Y O F S O U T H F L O R I D A // 7
Initial applications of clustering
U N I V E R S I T Y O F S O U T H F L O R I D A // 8
More recent example: computer vision
U N I V E R S I T Y O F S O U T H F L O R I D A // 9
Clustering: Basic idea
Group together similar instances: BUT a or b or c?
a. b.
c.
U N I V E R S I T Y O F S O U T H F L O R I D A // 10
Requirements for clustering
U N I V E R S I T Y O F S O U T H F L O R I D A // 11
Measure of distance or ‘dissimilarity’
U N I V E R S I T Y O F S O U T H F L O R I D A // 12
Cluster evaluation
• Intra-cluster cohesion (compactness):
• Cohesion measures how near the data points in a cluster are to the cluster centroid.
• Sum of squared error (SSE) is a commonly used measure.
• Inter-cluster separation (isolation):
• Separation means that different cluster centroids should be far away from one another.
• In most applications, the key is still expert judgment
U N I V E R S I T Y O F S O U T H F L O R I D A // 13
How many clusters?
U N I V E R S I T Y O F S O U T H F L O R I D A // 14
Clustering methods
U N I V E R S I T Y O F S O U T H F L O R I D A // 15
Clustering methods (contd.)
U N I V E R S I T Y O F S O U T H F L O R I D A // 16
Clustering: k-means
U N I V E R S I T Y O F S O U T H F L O R I D A // 17
K-means clustering
U N I V E R S I T Y O F S O U T H F L O R I D A // 18
K-means algorithm
U N I V E R S I T Y O F S O U T H F L O R I D A // 19
K-means convergence criterion
U N I V E R S I T Y O F S O U T H F L O R I D A // 20
K-means clustering: Step 1
U N I V E R S I T Y O F S O U T H F L O R I D A // 21
K-means clustering: Step 2
U N I V E R S I T Y O F S O U T H F L O R I D A // 22
k-means clustering example
U N I V E R S I T Y O F S O U T H F L O R I D A // 23
k-means clustering example
U N I V E R S I T Y O F S O U T H F L O R I D A // 24
k-means clustering example
U N I V E R S I T Y O F S O U T H F L O R I D A // 25
K-means: Strength
• Simple: easy to understand and to implement
• Efficient: Time complexity: O(tkn),
• where n is the number of data points,
• k is the number of clusters, and
• t is the number of iterations.
• Since both k and t are small, k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is used. The global optimum is
hard to find due to complexity.
U N I V E R S I T Y O F S O U T H F L O R I D A // 26
K-means: Weaknesses
• The algorithm is only applicable if the mean is defined.
• For categorical data, k-mode - the centroid is represented by most frequent values.
• The user needs to specify k
• The algorithm is sensitive to outliers
• Outliers are data points that are very far away from other data points.
• Outliers could be errors in the data recording or some special data points with very
different values
U N I V E R S I T Y O F S O U T H F L O R I D A // 27
Outliers
U N I V E R S I T Y O F S O U T H F L O R I D A // 28
Dealing with outliers
• Remove some data points that are much further away from the centroids than other
data points
• To be safe, we may want to monitor these possible outliers over a few
iterations and then decide to remove them.
• Perform random sampling: by choosing a small subset of the data points, the chance
of selecting an outlier is much smaller
• Assign the rest of the data points to the clusters by distance or similarity
comparison, or classification
U N I V E R S I T Y O F S O U T H F L O R I D A // 29
Special data structures: Limitation
• The k-means algorithm is not suitable for discovering clusters that are not hyper-
ellipsoids (or hyper-spheres)
U N I V E R S I T Y O F S O U T H F L O R I D A // 30
Example: k-means for segmentation
U N I V E R S I T Y O F S O U T H F L O R I D A // 31
Example: k-means for segmentation
U N I V E R S I T Y O F S O U T H F L O R I D A // 32
Example: k-means for segmentation
U N I V E R S I T Y O F S O U T H F L O R I D A // 33
K-means: Summary
• Despite weaknesses, k-means is still the most popular algorithm due to its simplicity and
efficiency
• No clear evidence that any other clustering algorithm performs better in general
• Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!
U N I V E R S I T Y O F S O U T H F L O R I D A // 34
Clustering method: Hierarchical
U N I V E R S I T Y O F S O U T H F L O R I D A // 35
Hierarchical clustering
U N I V E R S I T Y O F S O U T H F L O R I D A // 36
Example: Biological taxonomy
U N I V E R S I T Y O F S O U T H F L O R I D A // 37
Types of hierarchical clustering
• Top-down clustering (or divisive clustering)
• Starts with all data points in one cluster, the root, then
• splits the root into a set of child clusters. Each child cluster is recursively divided
further
• stops when only singleton clusters of individual data points remain, i.e., each
cluster with only a single point
• Bottom-up clustering (or agglomerative clustering)
• The dendrogram (?) is built from the bottom level by
• merging the most similar (or nearest) pair of clusters
• stopping when all the data points are merged into a single cluster (i.e., the root
cluster).
U N I V E R S I T Y O F S O U T H F L O R I D A // 38
What is a Dendrogram?
U N I V E R S I T Y O F S O U T H F L O R I D A // 39
Divisive hierarchical clustering
U N I V E R S I T Y O F S O U T H F L O R I D A // 40
Agglomerative hierarchical clustering
U N I V E R S I T Y O F S O U T H F L O R I D A // 41
Single linkage or Nearest neighbor
U N I V E R S I T Y O F S O U T H F L O R I D A // 42
Complete linkage or Farthest neighbor
U N I V E R S I T Y O F S O U T H F L O R I D A // 43
Divisive vs. Agglomerative
U N I V E R S I T Y O F S O U T H F L O R I D A // 44
In summary
• Clustering has a long history and still is in active research
• There are a huge number of clustering algorithms, among them:
Density based algorithm, Sub-space clustering, Scale-up methods,
Neural networks based methods, Fuzzy clustering, Co-clustering
• Clustering is hard to evaluate, but very useful in practice
• Clustering is highly application dependent (and to some extent subjective)
U N I V E R S I T Y O F S O U T H F L O R I D A // 45
Machine learning and classification
• Machine learning: studies how to automatically learn to make accurate
predictions based on past observations
• Classification problem: classify observations into given set of categories
U N I V E R S I T Y O F S O U T H F L O R I D A // 46
Examples of classification problem
• Text categorization (e.g., spam filtering)
• Fraud detection
• Optical character detection
• Machine vision (face detection)
• Natural-language processing(e.g., spoken language understanding)
• Market segmentation(e.g.: predict if customer will respond to promotion)
• Bioinformatics(e.g., classify proteins according to their function)
U N I V E R S I T Y O F S O U T H F L O R I D A // 47
Classification algorithms in machine learning
• Decision trees
• Boosting
• Support vector machine
• Neural networks
• Nearest neighbor algorithm
• Naive bays
• Bagging
• Random forest
• We will discuss only decision trees
5/23/2022 47
U N I V E R S I T Y O F S O U T H F L O R I D A // 48
Decision trees: An example (For Reference Only: Slides
48 to 62)
U N I V E R S I T Y O F S O U T H F L O R I D A // 49
A decision tree classifier
U N I V E R S I T Y O F S O U T H F L O R I D A // 50
How to built decision trees
• Choose rule to split on
• Divide data following the
• splitting rule
• in disjoint sets
U N I V E R S I T Y O F S O U T H F L O R I D A // 51
Building decision trees
U N I V E R S I T Y O F S O U T H F L O R I D A // 52
How to choose splitting rule?
U N I V E R S I T Y O F S O U T H F L O R I D A // 53
How to measure purity?
U N I V E R S I T Y O F S O U T H F L O R I D A // 54
Different types of error rates
• Training error = fraction of training examples misclassified
• Test error = fraction of test examples misclassified
• Generalization error = probability of misclassifying new random example
U N I V E R S I T Y O F S O U T H F L O R I D A // 55
A possible classifier
U N I V E R S I T Y O F S O U T H F L O R I D A // 56
Another classifier
U N I V E R S I T Y O F S O U T H F L O R I D A // 57
Test size vs accuracy
U N I V E R S I T Y O F S O U T H F L O R I D A // 58
Overfitting: an example
U N I V E R S I T Y O F S O U T H F L O R I D A // 59
Building a good classifier
U N I V E R S I T Y O F S O U T H F L O R I D A // 60
Example
U N I V E R S I T Y O F S O U T H F L O R I D A // 61
Good and bad classifiers
U N I V E R S I T Y O F S O U T H F L O R I D A // 62
Controlling tree size
U N I V E R S I T Y O F S O U T H F L O R I D A // 63
Good and bad of decision trees
• Very fast to train and evaluate
• Relatively easy to interpret
• BUT: Accuracy often not state-of-the-art
U N I V E R S I T Y O F S O U T H F L O R I D A // 64
Practical considerations
• There are many classification methods in machine learning
• Each of them has advantages and disadvantages and are good for specific
domain
• Practical considerations
• Want training data to be like test data
• Use your knowledge of problem to know where to get training data,
and what to expect test data to be like
U N I V E R S I T Y O F S O U T H F L O R I D A // 65
Structural Equation Modeling
• Structural equation modeling (SEM) is
• a series of statistical methods that allow complex relationships between one or
more independent variables and one or more dependent variables.
• Though there are many ways to describe SEM, it is most commonly thought of
• as a hybrid between some form of analysis of variance (ANOVA)/regression and
some form of factor analysis.
• In general, it can be remarked that SEM allows one to
• perform some type of multilevel regression/ANOVA on factors.
• One should therefore be quite familiar with
• univariate and multivariate regression/ANOVA
• as well as the basics of factor analysis
• to implement SEM for your data.
U N I V E R S I T Y O F S O U T H F L O R I D A // 66
Some preliminary terminology
• Variables that are not influenced by another other variables in a model are called exogenous
variables.
• As an example, suppose we have two factors that cause changes in GPA, hours studying
per week and IQ. Suppose there is no causal relationship between hours studying and
IQ. Then both IQ and hours studying would be exogenous variables in the model.
• Variables that are influenced by other variables in a model are called endogenous variables.
GPA would be a endogenous variable in the previous example in (a).
• A variable that is directly observed and measured is called a manifest variable (it is also called
an indicator variable in some circles).
• In the example in (a), all variables can be directly observed and thus qualify as manifest
variables. There is a special name for a structural equation model which examines only
manifest variables, called path analysis.
U N I V E R S I T Y O F S O U T H F L O R I D A // 67
Some preliminary terminology (contd.)
• A variable that is not directly measured is a latent variable.
• The “factors” in a factor analysis are latent variables.
• For example, suppose we were additionally interested in the impact of motivation on GPA.
Motivation, as it is an internal, non-observable state, is indirectly assessed by a student’s
response on a questionnaire, and thus it is a latent variable.
• Latent variables increase the complexity of a structural equation model because one needs
to take into account all of the questionnaire items and measured responses that are used to
quantify the “factor” or latent variable.
• In this instance, each item on the questionnaire would be a single variable that would either
be significantly or insignificantly involved in a linear combination of variables that influence
the variation in the latent factor of motivation
U N I V E R S I T Y O F S O U T H F L O R I D A // 68
Some preliminary terminology
• For the purposes of SEM, specifically, moderation refers to a situation that includes three or
more variables, such that the presence of one of those variables changes the relationship
between the other two.
• In other words, moderation exists when the association between two variables is not the
same at all levels of a third variable.
• One way to think of moderation is when you observe an interaction between two
variables in an ANOVA.
• For example, stress and psychological adjustment may differ at different levels of social
support (i.e., this is the definition of an interaction).
• In other words, stress may adversely affect adjustment more under conditions of low
social support compared to conditions of high social support. This would imply a two-
way interaction between stress and psychological support if an ANOVA were to be
performed.
• Figure 1 shows a conceptual diagram of moderation. This diagram shows that there are
three direct effects that are hypothesized to cause changes in psychological adjustment
– a main effect of stress, a main effect of social support, and an interaction effect of
stress and social support.
U N I V E R S I T Y O F S O U T H F L O R I D A // 69
What is moderation?
Figure 1. A Diagrammed Example of Moderation. Note that each individual effect of stress, social support, and the interaction
of stress and social support can be separated and is said to be related to psychological adjustment. Also note that there are no
causal connections between stress and social support.
 Figure 1 shows a conceptual diagram of moderation.
 This diagram shows that there are three direct effects that are
hypothesized to cause changes in psychological adjustment – a main effect
of stress, a main effect of social support, and an interaction effect of stress
and social support.
Stress
Social Support
Stress X Social
Support
Interaction
Psychological
Adjustment
U N I V E R S I T Y O F S O U T H F L O R I D A // 70
Some more terminology: Mediation
• In SEM, mediation refers to a situation
• that includes three or more variables, such that there is a causal process
between all three variables.
• Note that this is distinct from moderation.
• In the previous example in (e), we can say there are three separate things
in the model that cause a change in psychological adjustment: stress,
social support, and the combined effect of stress and social support that is
not accounted for by each individual variable.
• Mediation describes a much different relationship that is generally more
complex.
• In a mediation relationship, there is a direct effect between an independent
variable and a dependent variable.
• There are also indirect effects between an independent variable and a mediator
variable, and between a mediator variable and a dependent variable.
U N I V E R S I T Y O F S O U T H F L O R I D A // 71
Mediation (contd.)
• The example in Figure 1, can be re-specified into a mediating process, as shown in Figure 2.
• Figure 2: A Diagram of a Mediation Model. Including indirect effects in a mediation model may
change the direct effect of a single independent variable on a dependent variable.
Stress Psychological
Adjustment
Social
Support
(Revised) Direct Effect
Indirect Effect
Indirect Effect
U N I V E R S I T Y O F S O U T H F L O R I D A // 72
Mediation effect
• The main difference from the moderation model is that we now allow for causal relationships
between stress and social support and social support and psychological adjustment to be
expressed.
• Imagine that social support was not included in the model – we just wanted to see the direct
effect of stress and psychological adjustment.
• We would get a measure of the direct effect by using regression or ANOVA. When we include
social support as a mediator, that direct effect will change as a result of decomposing the
causal process into indirect effects of stress on social support and social support on
psychological adjustment.
• The degree to which the direct effect changes as a result of including the mediating variable
of social support is referred to as the mediational effect.
• Testing for mediation involves running a series of regression analyses for all of the causal
pathways and some method of estimating a change in direct effect. This technique is
actually involved in structural equation models that include mediator variables and will be
discussed in the next section of this document.
U N I V E R S I T Y O F S O U T H F L O R I D A // 73
Moderation and Mediation
• In many respects moderation and mediational models are the foundation of structural equation
modeling.
• In fact, they can be considered as simple structural equation models themselves.
• Therefore, it is very important to understand how to analyze such models to understand more
complex structural equation models that include latent variables.
U N I V E R S I T Y O F S O U T H F L O R I D A // 74
Covariance and Correlation
Covariance and correlation are the building blocks of how data will be represented in model
specification that implements structural equation modeling.
One should know how to obtain a correlation matrix or covariance matrix using PROC CORR in
SAS to calculate a correlation or covariance matrix.
The covariance matrix in practice serves as dataset to be analyzed.
In the context of SEM, covariances and correlations between variables are essential because
they allow one to include a relationship between two variables that is not necessarily causal.
In practice, most structural equation models contain both causal and non-causal relationships.
Obtaining covariance estimates between variables allows one to better estimate direct and
indirect effects with other variables, particularly in complex models with many parameters to
be estimated.
U N I V E R S I T Y O F S O U T H F L O R I D A // 75
Structural model
• A structural model is a part of the entire structural equation model diagram that one
will complete for every model one proposes.
• It is used to relate all of the variables (both latent and manifest) one will need to
account for in the model.
• There are a few important rules to follow when creating a structural model.
U N I V E R S I T Y O F S O U T H F L O R I D A // 76
Measurement model
• A measurement model is a part of the entire structural equation model diagram that
one will complete for every proposed model.
• It is essential if there are latent variables in the model.
• This part of the diagram which is analogous to factor analysis:
• One need to include all individual items, variables, or observations that “load” onto
the latent variable, their relationships, variances, and errors.
• There are a few important rules to follow when creating the measurement model that
we will discuss later document.
U N I V E R S I T Y O F S O U T H F L O R I D A // 77
Finally
• Together, the structural model and the measurement model form the entire structural
equation model.
• This model includes everything that has been measured, observed, or otherwise
manipulated in the set of variables examined.
• A recursive structural equation model is a model in which causation is directed in one
single direction.
• A nonrecursive structural equation model has causation which flows in both directions
at some parts of the model.
U N I V E R S I T Y O F S O U T H F L O R I D A // 78
Research questions suitable for SEM
• SEM can conceptually be used to answer any research question involving the indirect or
direct observation of one or more independent variables or one or more dependent
variables.
• However, the primary goal of SEM is to determine and validify a proposed causal process
and/or model.
• Therefore, SEM is a confirmatory technique.
• Like any other test or model, we have a sample and want to say something about the
population that comprises the sample. We have a covariance matrix to serve as our
dataset, which is based on the sample of collected measurements.
• The empirical question of SEM is therefore whether the proposed model produces a
population covariance matrix that is consistent with the sample covariance matrix.
• Because one must specify a priori a model that will undergo validation testing, there are
many questions SEM can answer.
U N I V E R S I T Y O F S O U T H F L O R I D A // 79
Limitations and Assumptions of SEM
• Because SEM is a confirmatory technique, one must plan accordingly.
• One must specify a full model a priori and test that model based on the sample and variables
included in the measurements.
• One must know the number of parameters one need to estimate – including covariances, path
coefficients, and variances.
• One must know all relationships one want to specify in the model. Then, and only then, can
one begin analyses.
• Because SEM has the ability to model complex relationships between multivariate data, sample size
is an important (but unfortunately underemphasized) issue.
• Two popular assumptions are that one needs more than 200 observations, or at least 50 more
than 8 times the number of variables in the model.
• A larger sample size is always desired for SEM.
U N I V E R S I T Y O F S O U T H F L O R I D A // 80
Limitations and assumptions (contd.)
• Like other multivariate statistical methodologies, most of the estimation
techniques used in SEM require multivariate normality.
• Data need to be examined for univariate and multivariate outliers.
Transformations on the variables can be made. However, there are some
estimation methods that do not require normality.
• SEM techniques only look at first-order (linear) relationships between variables.
• Linear relationships can be explored by creating bivariate scatterplots for
all of your variables. Power transformations can be made if a relationship
between two variables seems quadratic.
U N I V E R S I T Y O F S O U T H F L O R I D A // 81
Limitations and assumptions (contd.)
• Multicollinearity among the IVs for manifest variables can be an issue.
• Most programs will inspect the determinant of a section of covariance matrix,
or the whole covariance matrix.
• A very small determinant may be indicative of extreme multicollinearity.
• The residuals of the covariances (not residual scores) need to be small and centered
about zero.
• Some goodness of fit tests (like the Lagrange Multiplier test) remain robust
against highly deviated residuals or non-normal residuals.
U N I V E R S I T Y O F S O U T H F L O R I D A // 82
Preparing data for analysis
• Assuming one has checked for model assumptions, dealt with missing data, and imported
data into a SAS, one should obtain a covariance matrix on dataset.
• In SAS, you can run PROC CORR and create an output file that has the covariance
matrix.
• Alternatively, you can manually type in the covariance matrix or import it from a
spreadsheet.
• One must specify in the data declaration that the set is a covariance matrix, so, in
SAS for example, your code would appear as data name type=COV;
• effort must be made to ensure the decimals of the entries are aligned.
U N I V E R S I T Y O F S O U T H F L O R I D A // 83
Developing model hypothesis and specifications
• Often, the most difficult part of SEM is to correctly identifying model.
• One must know exactly the number of latent and manifest variables one is including in the
model, as well as the number of variances and covariances to be calculated, as well as the
number of parameters one is estimating.
U N I V E R S I T Y O F S O U T H F L O R I D A // 84
Building model
• The most important part of SEM analysis is the causal model. The following basic, general rules
that are used
• Rule 1. Latent variables/factors are represented with circles and measured/manifest variables
are represented with squares.
• Rule 2. Lines with an arrow in one direction show a hypothesized direct relationship between
the two variables. It should originate at the causal variable and point to the variable that is
caused. Absence of a line indicates there is no causal relationship between the variables.
U N I V E R S I T Y O F S O U T H F L O R I D A // 85
Building model (contd.)
• Rule 3. Lines with an arrow in both directions should be curved and this demonstrates a bi-
directional relationship (i.e., a covariance).
• Rule 3a. Covariance arrows should only be allowed for exogenous variables.
U N I V E R S I T Y O F S O U T H F L O R I D A // 86
Building model (contd.)
• Rule 4. For every endogenous variable, a residual term should be added in the model.
Generally, a residual term is a circle with the letter E written in it, which stands for error.
• Rule 4a. For latent variables that are also endogenous, a residual term is not called error in
the lingo of SEM. It is called a disturbance, and therefore the “error term” here would be a
circle with a D written in it, standing for disturbance.
U N I V E R S I T Y O F S O U T H F L O R I D A // 87
SEM
• SEM is useful method to build causal relationship models
• Is next step after Factor analysis and EFA and CFA.
• The Reading posted on canvas is very useful in practicing and understanding the details of
PROC CALIS
U N I V E R S I T Y O F S O U T H F L O R I D A // 88
U N I V E R S I T Y O F S O U T H F L O R I D A //
You have reached the end
of the presentation.

More Related Content

Similar to Segmentation: Clustering and Classification

CART Training 1999
CART Training 1999CART Training 1999
CART Training 1999
Salford Systems
 
4. Formulating research problems
4. Formulating research problems4. Formulating research problems
4. Formulating research problems
Razif Shahril
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Sherri Gunder
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysis
Kritika Jain
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01
deepti gupta
 
Clusteranalysis
Clusteranalysis Clusteranalysis
Clusteranalysis
deepti gupta
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
David Gleich
 
5. Non parametric analysis
5. Non parametric analysis5. Non parametric analysis
5. Non parametric analysis
Razif Shahril
 
Has modelling killed randomisation inference frankfurt
Has modelling killed randomisation inference frankfurtHas modelling killed randomisation inference frankfurt
Has modelling killed randomisation inference frankfurt
Stephen Senn
 
Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...
Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...
Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...
riamagdosa18
 
Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...
Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...
Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...
Hub-IT-School
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
Wei Zhong Toh
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
Jeet Das
 
Introduction to cart_2009
Introduction to cart_2009Introduction to cart_2009
Introduction to cart_2009
Matthew Magistrado
 
Unit 5-1.pdf
Unit 5-1.pdfUnit 5-1.pdf
Unit 5-1.pdf
marow75067
 
Bottle sum
Bottle sumBottle sum
Bottle sum
MasatoUmakoshi
 
Clusters (4).pptx
Clusters (4).pptxClusters (4).pptx
Clusters (4).pptx
brahimNasibov
 
Decision Tree Machine Learning Detailed Explanation.
Decision Tree Machine Learning Detailed Explanation.Decision Tree Machine Learning Detailed Explanation.
Decision Tree Machine Learning Detailed Explanation.
DrezzingGaming
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Suchismita Prusty
 
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forestsR User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
Wei Zhong Toh
 

Similar to Segmentation: Clustering and Classification (20)

CART Training 1999
CART Training 1999CART Training 1999
CART Training 1999
 
4. Formulating research problems
4. Formulating research problems4. Formulating research problems
4. Formulating research problems
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
 
Read first few slides cluster analysis
Read first few slides cluster analysisRead first few slides cluster analysis
Read first few slides cluster analysis
 
Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01Clusteranalysis 121206234137-phpapp01
Clusteranalysis 121206234137-phpapp01
 
Clusteranalysis
Clusteranalysis Clusteranalysis
Clusteranalysis
 
Engineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network AnalysisEngineering Data Science Objectives for Social Network Analysis
Engineering Data Science Objectives for Social Network Analysis
 
5. Non parametric analysis
5. Non parametric analysis5. Non parametric analysis
5. Non parametric analysis
 
Has modelling killed randomisation inference frankfurt
Has modelling killed randomisation inference frankfurtHas modelling killed randomisation inference frankfurt
Has modelling killed randomisation inference frankfurt
 
Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...
Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...
Updated-Version_Domain-1_-Strand-1.3_1.4_Policy-Planning-and-Research-and-Inn...
 
Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...
Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...
Hub AI&BigData meetup / Вадим Кузьменко: Как машинное обучение помогает снизи...
 
Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
Introduction to cart_2009
Introduction to cart_2009Introduction to cart_2009
Introduction to cart_2009
 
Unit 5-1.pdf
Unit 5-1.pdfUnit 5-1.pdf
Unit 5-1.pdf
 
Bottle sum
Bottle sumBottle sum
Bottle sum
 
Clusters (4).pptx
Clusters (4).pptxClusters (4).pptx
Clusters (4).pptx
 
Decision Tree Machine Learning Detailed Explanation.
Decision Tree Machine Learning Detailed Explanation.Decision Tree Machine Learning Detailed Explanation.
Decision Tree Machine Learning Detailed Explanation.
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forestsR User Group Singapore, Data Mining with R (Workshop II) - Random forests
R User Group Singapore, Data Mining with R (Workshop II) - Random forests
 

More from Michael770443

Discrete Choice Model - Part 2
Discrete Choice Model - Part 2Discrete Choice Model - Part 2
Discrete Choice Model - Part 2
Michael770443
 
Discrete Choice Model
Discrete Choice ModelDiscrete Choice Model
Discrete Choice Model
Michael770443
 
Categorical Data and Statistical Analysis
Categorical Data and Statistical AnalysisCategorical Data and Statistical Analysis
Categorical Data and Statistical Analysis
Michael770443
 
Analysis of Variance
Analysis of VarianceAnalysis of Variance
Analysis of Variance
Michael770443
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
Michael770443
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
Michael770443
 
Introduction to Statistical Methods
Introduction to Statistical MethodsIntroduction to Statistical Methods
Introduction to Statistical Methods
Michael770443
 
Overview of Statistical Concepts
Overview of Statistical ConceptsOverview of Statistical Concepts
Overview of Statistical Concepts
Michael770443
 

More from Michael770443 (8)

Discrete Choice Model - Part 2
Discrete Choice Model - Part 2Discrete Choice Model - Part 2
Discrete Choice Model - Part 2
 
Discrete Choice Model
Discrete Choice ModelDiscrete Choice Model
Discrete Choice Model
 
Categorical Data and Statistical Analysis
Categorical Data and Statistical AnalysisCategorical Data and Statistical Analysis
Categorical Data and Statistical Analysis
 
Analysis of Variance
Analysis of VarianceAnalysis of Variance
Analysis of Variance
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Linear Regression
Linear RegressionLinear Regression
Linear Regression
 
Introduction to Statistical Methods
Introduction to Statistical MethodsIntroduction to Statistical Methods
Introduction to Statistical Methods
 
Overview of Statistical Concepts
Overview of Statistical ConceptsOverview of Statistical Concepts
Overview of Statistical Concepts
 

Recently uploaded

Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
Chevonnese Chevers Whyte, MBA, B.Sc.
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
HajraNaeem15
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
Krassimira Luka
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
Nguyen Thanh Tu Collection
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
Nguyen Thanh Tu Collection
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
spdendr
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Leena Ghag-Sakpal
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
Nguyen Thanh Tu Collection
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Fajar Baskoro
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 

Recently uploaded (20)

Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Constructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective CommunicationConstructing Your Course Container for Effective Communication
Constructing Your Course Container for Effective Communication
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
How to deliver Powerpoint Presentations.pptx
How to deliver Powerpoint  Presentations.pptxHow to deliver Powerpoint  Presentations.pptx
How to deliver Powerpoint Presentations.pptx
 
Temple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation resultsTemple of Asclepius in Thrace. Excavation results
Temple of Asclepius in Thrace. Excavation results
 
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
BÀI TẬP BỔ TRỢ TIẾNG ANH LỚP 9 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2024-2025 - ...
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
BÀI TẬP BỔ TRỢ TIẾNG ANH 8 CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC 2023-2024 (CÓ FI...
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
Solutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptxSolutons Maths Escape Room Spatial .pptx
Solutons Maths Escape Room Spatial .pptx
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
Bed Making ( Introduction, Purpose, Types, Articles, Scientific principles, N...
 
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
BÀI TẬP DẠY THÊM TIẾNG ANH LỚP 7 CẢ NĂM FRIENDS PLUS SÁCH CHÂN TRỜI SÁNG TẠO ...
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
Pengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptxPengantar Penggunaan Flutter - Dart programming language1.pptx
Pengantar Penggunaan Flutter - Dart programming language1.pptx
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 

Segmentation: Clustering and Classification

  • 1. U N I V E R S I T Y O F S O U T H F L O R I D A // 1 U N I V E R S I T Y O F S O U T H F L O R I D A // U N I V E R S I T Y O F S O U T H F L O R I D A // Segmentation: Clustering and Classification Dr. S. Shivendu
  • 2. U N I V E R S I T Y O F S O U T H F L O R I D A // 2 U N I V E R S I T Y O F S O U T H F L O R I D A // 2 Objectives Overview of Segmentation Perform clustering and classification. 01 Perform structural equation modeling. 02 Perform programming in SAS to do factor analysis 03
  • 3. U N I V E R S I T Y O F S O U T H F L O R I D A // 3 U N I V E R S I T Y O F S O U T H F L O R I D A // 3 Agenda Overview of Statistical Concepts Segmentation: Clustering and Classification Structural Equation Modeling Factor Analysis in SAS
  • 4. U N I V E R S I T Y O F S O U T H F L O R I D A // 4 U N I V E R S I T Y O F S O U T H F L O R I D A // 4 Course Textbooks
  • 5. U N I V E R S I T Y O F S O U T H F L O R I D A // 5 Segmentation • Classifying or dividing set of objects or consumers or people into different groups • Market segmentation refers to dividing consumers into different distinct groups based on some characteristics • Statistical classification refers to using statistical tools to divide the population in sub groups • Machine learning methods are also used for dividing the population into subgroups • We will cover two approaches today • Clustering • Machine learning method – Decision Trees – for classification (very briefly)
  • 6. U N I V E R S I T Y O F S O U T H F L O R I D A // 6 Unsupervised learning: Clustering • The organization of unlabeled data into similarity groups is called clusters. • Therefore, a cluster is a collection of data items which are “similar” between them, and “dissimilar” to data items in other clusters.
  • 7. U N I V E R S I T Y O F S O U T H F L O R I D A // 7 Initial applications of clustering
  • 8. U N I V E R S I T Y O F S O U T H F L O R I D A // 8 More recent example: computer vision
  • 9. U N I V E R S I T Y O F S O U T H F L O R I D A // 9 Clustering: Basic idea Group together similar instances: BUT a or b or c? a. b. c.
  • 10. U N I V E R S I T Y O F S O U T H F L O R I D A // 10 Requirements for clustering
  • 11. U N I V E R S I T Y O F S O U T H F L O R I D A // 11 Measure of distance or ‘dissimilarity’
  • 12. U N I V E R S I T Y O F S O U T H F L O R I D A // 12 Cluster evaluation • Intra-cluster cohesion (compactness): • Cohesion measures how near the data points in a cluster are to the cluster centroid. • Sum of squared error (SSE) is a commonly used measure. • Inter-cluster separation (isolation): • Separation means that different cluster centroids should be far away from one another. • In most applications, the key is still expert judgment
  • 13. U N I V E R S I T Y O F S O U T H F L O R I D A // 13 How many clusters?
  • 14. U N I V E R S I T Y O F S O U T H F L O R I D A // 14 Clustering methods
  • 15. U N I V E R S I T Y O F S O U T H F L O R I D A // 15 Clustering methods (contd.)
  • 16. U N I V E R S I T Y O F S O U T H F L O R I D A // 16 Clustering: k-means
  • 17. U N I V E R S I T Y O F S O U T H F L O R I D A // 17 K-means clustering
  • 18. U N I V E R S I T Y O F S O U T H F L O R I D A // 18 K-means algorithm
  • 19. U N I V E R S I T Y O F S O U T H F L O R I D A // 19 K-means convergence criterion
  • 20. U N I V E R S I T Y O F S O U T H F L O R I D A // 20 K-means clustering: Step 1
  • 21. U N I V E R S I T Y O F S O U T H F L O R I D A // 21 K-means clustering: Step 2
  • 22. U N I V E R S I T Y O F S O U T H F L O R I D A // 22 k-means clustering example
  • 23. U N I V E R S I T Y O F S O U T H F L O R I D A // 23 k-means clustering example
  • 24. U N I V E R S I T Y O F S O U T H F L O R I D A // 24 k-means clustering example
  • 25. U N I V E R S I T Y O F S O U T H F L O R I D A // 25 K-means: Strength • Simple: easy to understand and to implement • Efficient: Time complexity: O(tkn), • where n is the number of data points, • k is the number of clusters, and • t is the number of iterations. • Since both k and t are small, k-means is considered a linear algorithm. • K-means is the most popular clustering algorithm. • Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.
  • 26. U N I V E R S I T Y O F S O U T H F L O R I D A // 26 K-means: Weaknesses • The algorithm is only applicable if the mean is defined. • For categorical data, k-mode - the centroid is represented by most frequent values. • The user needs to specify k • The algorithm is sensitive to outliers • Outliers are data points that are very far away from other data points. • Outliers could be errors in the data recording or some special data points with very different values
  • 27. U N I V E R S I T Y O F S O U T H F L O R I D A // 27 Outliers
  • 28. U N I V E R S I T Y O F S O U T H F L O R I D A // 28 Dealing with outliers • Remove some data points that are much further away from the centroids than other data points • To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. • Perform random sampling: by choosing a small subset of the data points, the chance of selecting an outlier is much smaller • Assign the rest of the data points to the clusters by distance or similarity comparison, or classification
  • 29. U N I V E R S I T Y O F S O U T H F L O R I D A // 29 Special data structures: Limitation • The k-means algorithm is not suitable for discovering clusters that are not hyper- ellipsoids (or hyper-spheres)
  • 30. U N I V E R S I T Y O F S O U T H F L O R I D A // 30 Example: k-means for segmentation
  • 31. U N I V E R S I T Y O F S O U T H F L O R I D A // 31 Example: k-means for segmentation
  • 32. U N I V E R S I T Y O F S O U T H F L O R I D A // 32 Example: k-means for segmentation
  • 33. U N I V E R S I T Y O F S O U T H F L O R I D A // 33 K-means: Summary • Despite weaknesses, k-means is still the most popular algorithm due to its simplicity and efficiency • No clear evidence that any other clustering algorithm performs better in general • Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!
  • 34. U N I V E R S I T Y O F S O U T H F L O R I D A // 34 Clustering method: Hierarchical
  • 35. U N I V E R S I T Y O F S O U T H F L O R I D A // 35 Hierarchical clustering
  • 36. U N I V E R S I T Y O F S O U T H F L O R I D A // 36 Example: Biological taxonomy
  • 37. U N I V E R S I T Y O F S O U T H F L O R I D A // 37 Types of hierarchical clustering • Top-down clustering (or divisive clustering) • Starts with all data points in one cluster, the root, then • splits the root into a set of child clusters. Each child cluster is recursively divided further • stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point • Bottom-up clustering (or agglomerative clustering) • The dendrogram (?) is built from the bottom level by • merging the most similar (or nearest) pair of clusters • stopping when all the data points are merged into a single cluster (i.e., the root cluster).
  • 38. U N I V E R S I T Y O F S O U T H F L O R I D A // 38 What is a Dendrogram?
  • 39. U N I V E R S I T Y O F S O U T H F L O R I D A // 39 Divisive hierarchical clustering
  • 40. U N I V E R S I T Y O F S O U T H F L O R I D A // 40 Agglomerative hierarchical clustering
  • 41. U N I V E R S I T Y O F S O U T H F L O R I D A // 41 Single linkage or Nearest neighbor
  • 42. U N I V E R S I T Y O F S O U T H F L O R I D A // 42 Complete linkage or Farthest neighbor
  • 43. U N I V E R S I T Y O F S O U T H F L O R I D A // 43 Divisive vs. Agglomerative
  • 44. U N I V E R S I T Y O F S O U T H F L O R I D A // 44 In summary • Clustering has a long history and still is in active research • There are a huge number of clustering algorithms, among them: Density based algorithm, Sub-space clustering, Scale-up methods, Neural networks based methods, Fuzzy clustering, Co-clustering • Clustering is hard to evaluate, but very useful in practice • Clustering is highly application dependent (and to some extent subjective)
  • 45. U N I V E R S I T Y O F S O U T H F L O R I D A // 45 Machine learning and classification • Machine learning: studies how to automatically learn to make accurate predictions based on past observations • Classification problem: classify observations into given set of categories
  • 46. U N I V E R S I T Y O F S O U T H F L O R I D A // 46 Examples of classification problem • Text categorization (e.g., spam filtering) • Fraud detection • Optical character detection • Machine vision (face detection) • Natural-language processing(e.g., spoken language understanding) • Market segmentation(e.g.: predict if customer will respond to promotion) • Bioinformatics(e.g., classify proteins according to their function)
  • 47. U N I V E R S I T Y O F S O U T H F L O R I D A // 47 Classification algorithms in machine learning • Decision trees • Boosting • Support vector machine • Neural networks • Nearest neighbor algorithm • Naive bays • Bagging • Random forest • We will discuss only decision trees 5/23/2022 47
  • 48. U N I V E R S I T Y O F S O U T H F L O R I D A // 48 Decision trees: An example (For Reference Only: Slides 48 to 62)
  • 49. U N I V E R S I T Y O F S O U T H F L O R I D A // 49 A decision tree classifier
  • 50. U N I V E R S I T Y O F S O U T H F L O R I D A // 50 How to built decision trees • Choose rule to split on • Divide data following the • splitting rule • in disjoint sets
  • 51. U N I V E R S I T Y O F S O U T H F L O R I D A // 51 Building decision trees
  • 52. U N I V E R S I T Y O F S O U T H F L O R I D A // 52 How to choose splitting rule?
  • 53. U N I V E R S I T Y O F S O U T H F L O R I D A // 53 How to measure purity?
  • 54. U N I V E R S I T Y O F S O U T H F L O R I D A // 54 Different types of error rates • Training error = fraction of training examples misclassified • Test error = fraction of test examples misclassified • Generalization error = probability of misclassifying new random example
  • 55. U N I V E R S I T Y O F S O U T H F L O R I D A // 55 A possible classifier
  • 56. U N I V E R S I T Y O F S O U T H F L O R I D A // 56 Another classifier
  • 57. U N I V E R S I T Y O F S O U T H F L O R I D A // 57 Test size vs accuracy
  • 58. U N I V E R S I T Y O F S O U T H F L O R I D A // 58 Overfitting: an example
  • 59. U N I V E R S I T Y O F S O U T H F L O R I D A // 59 Building a good classifier
  • 60. U N I V E R S I T Y O F S O U T H F L O R I D A // 60 Example
  • 61. U N I V E R S I T Y O F S O U T H F L O R I D A // 61 Good and bad classifiers
  • 62. U N I V E R S I T Y O F S O U T H F L O R I D A // 62 Controlling tree size
  • 63. U N I V E R S I T Y O F S O U T H F L O R I D A // 63 Good and bad of decision trees • Very fast to train and evaluate • Relatively easy to interpret • BUT: Accuracy often not state-of-the-art
  • 64. U N I V E R S I T Y O F S O U T H F L O R I D A // 64 Practical considerations • There are many classification methods in machine learning • Each of them has advantages and disadvantages and are good for specific domain • Practical considerations • Want training data to be like test data • Use your knowledge of problem to know where to get training data, and what to expect test data to be like
  • 65. U N I V E R S I T Y O F S O U T H F L O R I D A // 65 Structural Equation Modeling • Structural equation modeling (SEM) is • a series of statistical methods that allow complex relationships between one or more independent variables and one or more dependent variables. • Though there are many ways to describe SEM, it is most commonly thought of • as a hybrid between some form of analysis of variance (ANOVA)/regression and some form of factor analysis. • In general, it can be remarked that SEM allows one to • perform some type of multilevel regression/ANOVA on factors. • One should therefore be quite familiar with • univariate and multivariate regression/ANOVA • as well as the basics of factor analysis • to implement SEM for your data.
  • 66. U N I V E R S I T Y O F S O U T H F L O R I D A // 66 Some preliminary terminology • Variables that are not influenced by another other variables in a model are called exogenous variables. • As an example, suppose we have two factors that cause changes in GPA, hours studying per week and IQ. Suppose there is no causal relationship between hours studying and IQ. Then both IQ and hours studying would be exogenous variables in the model. • Variables that are influenced by other variables in a model are called endogenous variables. GPA would be a endogenous variable in the previous example in (a). • A variable that is directly observed and measured is called a manifest variable (it is also called an indicator variable in some circles). • In the example in (a), all variables can be directly observed and thus qualify as manifest variables. There is a special name for a structural equation model which examines only manifest variables, called path analysis.
  • 67. U N I V E R S I T Y O F S O U T H F L O R I D A // 67 Some preliminary terminology (contd.) • A variable that is not directly measured is a latent variable. • The “factors” in a factor analysis are latent variables. • For example, suppose we were additionally interested in the impact of motivation on GPA. Motivation, as it is an internal, non-observable state, is indirectly assessed by a student’s response on a questionnaire, and thus it is a latent variable. • Latent variables increase the complexity of a structural equation model because one needs to take into account all of the questionnaire items and measured responses that are used to quantify the “factor” or latent variable. • In this instance, each item on the questionnaire would be a single variable that would either be significantly or insignificantly involved in a linear combination of variables that influence the variation in the latent factor of motivation
  • 68. U N I V E R S I T Y O F S O U T H F L O R I D A // 68 Some preliminary terminology • For the purposes of SEM, specifically, moderation refers to a situation that includes three or more variables, such that the presence of one of those variables changes the relationship between the other two. • In other words, moderation exists when the association between two variables is not the same at all levels of a third variable. • One way to think of moderation is when you observe an interaction between two variables in an ANOVA. • For example, stress and psychological adjustment may differ at different levels of social support (i.e., this is the definition of an interaction). • In other words, stress may adversely affect adjustment more under conditions of low social support compared to conditions of high social support. This would imply a two- way interaction between stress and psychological support if an ANOVA were to be performed. • Figure 1 shows a conceptual diagram of moderation. This diagram shows that there are three direct effects that are hypothesized to cause changes in psychological adjustment – a main effect of stress, a main effect of social support, and an interaction effect of stress and social support.
  • 69. U N I V E R S I T Y O F S O U T H F L O R I D A // 69 What is moderation? Figure 1. A Diagrammed Example of Moderation. Note that each individual effect of stress, social support, and the interaction of stress and social support can be separated and is said to be related to psychological adjustment. Also note that there are no causal connections between stress and social support.  Figure 1 shows a conceptual diagram of moderation.  This diagram shows that there are three direct effects that are hypothesized to cause changes in psychological adjustment – a main effect of stress, a main effect of social support, and an interaction effect of stress and social support. Stress Social Support Stress X Social Support Interaction Psychological Adjustment
  • 70. U N I V E R S I T Y O F S O U T H F L O R I D A // 70 Some more terminology: Mediation • In SEM, mediation refers to a situation • that includes three or more variables, such that there is a causal process between all three variables. • Note that this is distinct from moderation. • In the previous example in (e), we can say there are three separate things in the model that cause a change in psychological adjustment: stress, social support, and the combined effect of stress and social support that is not accounted for by each individual variable. • Mediation describes a much different relationship that is generally more complex. • In a mediation relationship, there is a direct effect between an independent variable and a dependent variable. • There are also indirect effects between an independent variable and a mediator variable, and between a mediator variable and a dependent variable.
  • 71. U N I V E R S I T Y O F S O U T H F L O R I D A // 71 Mediation (contd.) • The example in Figure 1, can be re-specified into a mediating process, as shown in Figure 2. • Figure 2: A Diagram of a Mediation Model. Including indirect effects in a mediation model may change the direct effect of a single independent variable on a dependent variable. Stress Psychological Adjustment Social Support (Revised) Direct Effect Indirect Effect Indirect Effect
  • 72. U N I V E R S I T Y O F S O U T H F L O R I D A // 72 Mediation effect • The main difference from the moderation model is that we now allow for causal relationships between stress and social support and social support and psychological adjustment to be expressed. • Imagine that social support was not included in the model – we just wanted to see the direct effect of stress and psychological adjustment. • We would get a measure of the direct effect by using regression or ANOVA. When we include social support as a mediator, that direct effect will change as a result of decomposing the causal process into indirect effects of stress on social support and social support on psychological adjustment. • The degree to which the direct effect changes as a result of including the mediating variable of social support is referred to as the mediational effect. • Testing for mediation involves running a series of regression analyses for all of the causal pathways and some method of estimating a change in direct effect. This technique is actually involved in structural equation models that include mediator variables and will be discussed in the next section of this document.
  • 73. U N I V E R S I T Y O F S O U T H F L O R I D A // 73 Moderation and Mediation • In many respects moderation and mediational models are the foundation of structural equation modeling. • In fact, they can be considered as simple structural equation models themselves. • Therefore, it is very important to understand how to analyze such models to understand more complex structural equation models that include latent variables.
  • 74. U N I V E R S I T Y O F S O U T H F L O R I D A // 74 Covariance and Correlation Covariance and correlation are the building blocks of how data will be represented in model specification that implements structural equation modeling. One should know how to obtain a correlation matrix or covariance matrix using PROC CORR in SAS to calculate a correlation or covariance matrix. The covariance matrix in practice serves as dataset to be analyzed. In the context of SEM, covariances and correlations between variables are essential because they allow one to include a relationship between two variables that is not necessarily causal. In practice, most structural equation models contain both causal and non-causal relationships. Obtaining covariance estimates between variables allows one to better estimate direct and indirect effects with other variables, particularly in complex models with many parameters to be estimated.
  • 75. U N I V E R S I T Y O F S O U T H F L O R I D A // 75 Structural model • A structural model is a part of the entire structural equation model diagram that one will complete for every model one proposes. • It is used to relate all of the variables (both latent and manifest) one will need to account for in the model. • There are a few important rules to follow when creating a structural model.
  • 76. U N I V E R S I T Y O F S O U T H F L O R I D A // 76 Measurement model • A measurement model is a part of the entire structural equation model diagram that one will complete for every proposed model. • It is essential if there are latent variables in the model. • This part of the diagram which is analogous to factor analysis: • One need to include all individual items, variables, or observations that “load” onto the latent variable, their relationships, variances, and errors. • There are a few important rules to follow when creating the measurement model that we will discuss later document.
  • 77. U N I V E R S I T Y O F S O U T H F L O R I D A // 77 Finally • Together, the structural model and the measurement model form the entire structural equation model. • This model includes everything that has been measured, observed, or otherwise manipulated in the set of variables examined. • A recursive structural equation model is a model in which causation is directed in one single direction. • A nonrecursive structural equation model has causation which flows in both directions at some parts of the model.
  • 78. U N I V E R S I T Y O F S O U T H F L O R I D A // 78 Research questions suitable for SEM • SEM can conceptually be used to answer any research question involving the indirect or direct observation of one or more independent variables or one or more dependent variables. • However, the primary goal of SEM is to determine and validify a proposed causal process and/or model. • Therefore, SEM is a confirmatory technique. • Like any other test or model, we have a sample and want to say something about the population that comprises the sample. We have a covariance matrix to serve as our dataset, which is based on the sample of collected measurements. • The empirical question of SEM is therefore whether the proposed model produces a population covariance matrix that is consistent with the sample covariance matrix. • Because one must specify a priori a model that will undergo validation testing, there are many questions SEM can answer.
  • 79. U N I V E R S I T Y O F S O U T H F L O R I D A // 79 Limitations and Assumptions of SEM • Because SEM is a confirmatory technique, one must plan accordingly. • One must specify a full model a priori and test that model based on the sample and variables included in the measurements. • One must know the number of parameters one need to estimate – including covariances, path coefficients, and variances. • One must know all relationships one want to specify in the model. Then, and only then, can one begin analyses. • Because SEM has the ability to model complex relationships between multivariate data, sample size is an important (but unfortunately underemphasized) issue. • Two popular assumptions are that one needs more than 200 observations, or at least 50 more than 8 times the number of variables in the model. • A larger sample size is always desired for SEM.
  • 80. U N I V E R S I T Y O F S O U T H F L O R I D A // 80 Limitations and assumptions (contd.) • Like other multivariate statistical methodologies, most of the estimation techniques used in SEM require multivariate normality. • Data need to be examined for univariate and multivariate outliers. Transformations on the variables can be made. However, there are some estimation methods that do not require normality. • SEM techniques only look at first-order (linear) relationships between variables. • Linear relationships can be explored by creating bivariate scatterplots for all of your variables. Power transformations can be made if a relationship between two variables seems quadratic.
  • 81. U N I V E R S I T Y O F S O U T H F L O R I D A // 81 Limitations and assumptions (contd.) • Multicollinearity among the IVs for manifest variables can be an issue. • Most programs will inspect the determinant of a section of covariance matrix, or the whole covariance matrix. • A very small determinant may be indicative of extreme multicollinearity. • The residuals of the covariances (not residual scores) need to be small and centered about zero. • Some goodness of fit tests (like the Lagrange Multiplier test) remain robust against highly deviated residuals or non-normal residuals.
  • 82. U N I V E R S I T Y O F S O U T H F L O R I D A // 82 Preparing data for analysis • Assuming one has checked for model assumptions, dealt with missing data, and imported data into a SAS, one should obtain a covariance matrix on dataset. • In SAS, you can run PROC CORR and create an output file that has the covariance matrix. • Alternatively, you can manually type in the covariance matrix or import it from a spreadsheet. • One must specify in the data declaration that the set is a covariance matrix, so, in SAS for example, your code would appear as data name type=COV; • effort must be made to ensure the decimals of the entries are aligned.
  • 83. U N I V E R S I T Y O F S O U T H F L O R I D A // 83 Developing model hypothesis and specifications • Often, the most difficult part of SEM is to correctly identifying model. • One must know exactly the number of latent and manifest variables one is including in the model, as well as the number of variances and covariances to be calculated, as well as the number of parameters one is estimating.
  • 84. U N I V E R S I T Y O F S O U T H F L O R I D A // 84 Building model • The most important part of SEM analysis is the causal model. The following basic, general rules that are used • Rule 1. Latent variables/factors are represented with circles and measured/manifest variables are represented with squares. • Rule 2. Lines with an arrow in one direction show a hypothesized direct relationship between the two variables. It should originate at the causal variable and point to the variable that is caused. Absence of a line indicates there is no causal relationship between the variables.
  • 85. U N I V E R S I T Y O F S O U T H F L O R I D A // 85 Building model (contd.) • Rule 3. Lines with an arrow in both directions should be curved and this demonstrates a bi- directional relationship (i.e., a covariance). • Rule 3a. Covariance arrows should only be allowed for exogenous variables.
  • 86. U N I V E R S I T Y O F S O U T H F L O R I D A // 86 Building model (contd.) • Rule 4. For every endogenous variable, a residual term should be added in the model. Generally, a residual term is a circle with the letter E written in it, which stands for error. • Rule 4a. For latent variables that are also endogenous, a residual term is not called error in the lingo of SEM. It is called a disturbance, and therefore the “error term” here would be a circle with a D written in it, standing for disturbance.
  • 87. U N I V E R S I T Y O F S O U T H F L O R I D A // 87 SEM • SEM is useful method to build causal relationship models • Is next step after Factor analysis and EFA and CFA. • The Reading posted on canvas is very useful in practicing and understanding the details of PROC CALIS
  • 88. U N I V E R S I T Y O F S O U T H F L O R I D A // 88 U N I V E R S I T Y O F S O U T H F L O R I D A // You have reached the end of the presentation.