Sess07 Clustering02_KohonenNet.pptx

Cluster Analysis
Kohonen Networks

Kohonen Networks – Self Organizing Map
• Kohonen networks were introduced in 1982 by Finnish researcher Tuevo Kohonen
• Although applied initially to image and sound analyses, Kohonen networks are
nevertheless an effective mechanism for clustering analysis
• Kohonen networks represent a type of self-organizing map (SOM), which itself
represents a special class of neural networks
• The goal of SOMs is to convert a complex high-dimensional input signal into a
simpler low-dimensional discrete map - nicely appropriate for cluster analysis,
where underlying hidden patterns among records and fields are sought
• SOMs structure the output nodes into clusters of nodes, where nodes in closer
proximity are more similar to each other than to other nodes that are farther apart
• Some researchers have shown that SOMs represent a nonlinear generalization of
principal components analysis, another dimension-reduction technique
• SOMs are based on competitive learning, where the output nodes compete among
themselves to be the winning node (or neuron), the only node to be activated by a
particular input observation

Clustering Task
• A typical SOM architecture is shown in the following figure
• The input layer is shown at the bottom of the figure, with one input node for each
field. Just as with neural networks, these input nodes perform no processing
themselves, but simply pass the field input values along downstream

Prerequisites of Clustering
• Like neural networks, SOMs are feedforward and completely connected
• Feedforward networks do not allow looping or cycling.
• Completely connected means that every node in a given layer is connected to every
node in the next layer, although not to other nodes in the same layer
• Like neural networks, each connection between nodes has a weight associated with
it, which at initialization is assigned randomly to a value between 0 and 1 
Adjusting these weights represents the key for the learning mechanism in both
neural networks and SOMs
• Variable values need to be normalized or standardized, just as for neural networks,
so that certain variables do not overwhelm others in the learning algorithm
• Unlike most neural networks SOMs have no hidden layer
• The data from the input layer is passed along directly to the output layer
• The output layer is represented in the form of a lattice, usually in one or two
dimensions, and typically in the shape of a rectangle, although other shapes, such as
hexagons, may be used
• The output layer shown in the earlier figure is a 3 × 3 square

Working of SOM
• For a given record (instance), a particular field value is forwarded from a particular
input node to every node in the output layer
• For example, suppose that the normalized age and income values for the first record
in the data set are 0.69 and 0.88, respectively.
• The 0.69 value would enter the SOM through the input node associated with age,
and this node would pass this value of 0.69 to every node in the output layer
• Similarly, the 0.88 value would be distributed through the income input node to
every node in the output layer
• These values, together with the weights assigned to each of the connections, would
determine the values of a scoring function (such as Euclidean distance) for each
output node
• The output node with the “best” outcome from the scoring function would then
be designated as the winning node

Working of SOM
• SOMs exhibit three characteristic processes:
1. Competition
• As mentioned above, the output nodes compete with each other to produce the best
value for a particular scoring function, most commonly the Euclidean distance
• In this case, the output node that has the smallest Euclidean distance between the
field inputs and the connection weights would be declared the winner
2. Cooperation.
• The winning node therefore becomes the center of a neighborhood of excited neurons.
This emulates the behavior of human neurons, which are sensitive to the output of other
neurons in their immediate neighborhood.
• In SOMs, all the nodes in this neighborhood share in the “excitement” or “reward”
earned by the winning nodes, that of adaptation. Thus, even though the nodes in the
output layer are not connected directly, they tend to share common features, due to
this neighborliness parameter
3. Adaptation.
• The nodes in the neighborhood of the winning node participate in adaptation, that is,
learning The weights of these nodes are adjusted so as to further improve the score
function
• In other words, these nodes will thereby have an increased chance of winning the
competition once again, for a similar set of field values

Kohonen Networks
• Kohonen networks are SOMs that exhibit Kohonen learning
• Suppose that we consider the set of m field values for the nth record to be an input
vector and the current set of m weights for a particular
output node j to be a weight vector
• In Kohonen learning, the nodes in the neighborhood of the winning node adjust
their weights using a linear combination of the input vector and the current weight
vector
• Kohonen indicates that the learning rate 𝜂 should be a decreasing function of
training epochs (runs through the data set) and that a linearly or geometrically
decreasing 𝜂 is satisfactory for most purposes

Algorithm for Kohonen Networks
• The algorithm for Kohonen Networks is shown below:
For each input vector x, perform the following steps:
• Competition : For each output node j, calculate the value of the
scoring function. For example, for Euclidean distance,
Find the winning node J that minimizes over all output nodes
• Cooperation : Identify all output nodes j within the topological neighborhood of
J defined by the neighborhood size R.
• Updation : Adjust the weights:
Adjust the learning rate and neighborhood size, as needed
• Stop when the termination criteria are met

A Worked Example
• Suppose that we have a data set with two attributes, age and income, which have
already been normalized, and suppose that we would like to use a 2 × 2 Kohonen
network to uncover hidden clusters in the data set. The Network is shown below:

A Worked Example
• With such a small network, we set the neighborhood size to be R = 0, so that only
the winning node will be awarded the opportunity to adjust its weight
• Also, we set the learning rate 𝜂 to be 0.5
• Assume that the weights have been randomly initialized as follows
• For the first input vector, x1 = (0.8, 0.8), we perform the following competition,
cooperation, and adaptation sequence
• Competition Step : Compute the Euclidean Distances between the input and weights
for four of the output nodes
• So, Node 1 is the winner here, why? As it might be seen that the input vector (0.8,
0.8) is most similar to the weight (0.9, 0.8)
• In other words, we may expect node 1 to uncover a cluster of older, high-income
persons

A Worked Example
• Cooperation. In this simple example, we have set the neighborhood size R = 0 so that
the level of cooperation among output nodes is nil! Therefore, only the winning
node, node 1, will be rewarded with a weight adjustment. (We omit this step in the
remainder of the example.)
• Adaptation. For the winning node, node 1, the weights are adjusted as follows: For j
= 1 (node 1), n = 1 (the first record) and learning rate 𝜂 = 0.5, for each of the fields
this becomes
• For age:
• For Income :
• Note the type of adjustment that takes place. The weights are nudged in the
direction of the fields’ values of the input record. That is, w11, the weight on the age
connection for the winning node, was originally 0.9, but was adjusted in the
direction of the normalized value for age in the first record, 0.8. As the learning rate
𝜂 = 0.5, this adjustment is half (0.5) of the distance between the current weight and
the field value
• This adjustment will help node 1 to become even more proficient at capturing the
records of older, high-income persons

A Worked Example
• Let's use this concept to 3 more input points x2 = (0.8, 0.1), x3 = (0.2, 0.9) and x1 =
(0.1, 0.1) with the same initial weights
• For input 2, the distances are computed as 0.78, 0.14, 0.99 and 0.71 and Output
node 2 is the winner, the updated weights are 0.85 and 0.15
• Consolidated outcome is as follows
Cluster Associated with Description
1 Node 1 Older person with high
income
2 Node 2 Older person with low
income
3 Node 3 Younger person with
high income
4 Node 4 Younger person with
low income

A Bigger Example
• Let’s apply the Kohonen Net to the Churn dataset
• Recall that the data set contains 20 variables worth of information about 3333
customers, along with an indication of whether that customer churned (left the
company) or not
• Total 9 variables/attributes are used for clustering the customers – 2 Categorical
(International plan and Voice Mail plan) and 7 continuous (Account length, voice
mail messages, day minutes, evening minutes, night minutes, international minutes,
and customer service calls)
• Z-score standardization has been adopted for standardizing the continuous variables
• A 3X3 Kohonen Net architecture has been adopted
• The learning of parameters were set as follows :
• For the first 20 cycles (passes through the data set), the neighborhood size was
set at R = 2, and the learning rate was set to decay linearly starting at 𝜂 = 0.3
• Then, for the next 150 cycles, the neighborhood size was reset to R = 1, while
the learning rate was allowed to decay linearly from 𝜂 = 0.3 to 𝜂 = 0

A Bigger Example
• The net architecture is given below :

A Bigger Example
• As it turned out, the Kohonen algorithm used only six of the nine available output
nodes, as shown in the following figure with output nodes 01, 11, and 21 being
pruned

A Bigger Example - Interpretation
• Can we develop the Cluster profiles?
• Consider International Plan = Yes v/s International Plan = No
• International Plan = Yes customers reside exclusively in Nodes 12 and 22

• Can we develop the Cluster profiles?
• Consider Voice mail = Yes v/s Voice main = No
• The three clusters along the bottom row (i.e., Cluster 00, Cluster 10, and Cluster 20)
contain only non-adopters of the VoiceMail Plan
• Clusters 02 and 12 contain only adopters of the VoiceMail Plan
• Cluster 22 contains mostly non-adopters and also some adopters of the VoiceMail

• Recall that because of the neighborliness parameter, clusters that are closer
together should be more similar than clusters that are farther apart.
• Note that all International Plan adopters reside in contiguous (neighboring) clusters,
as do all non-adopters. Similarly for later figure, except that Cluster 22 contains a
mixture
• We see that Cluster 12 represents a special subset of customers, those who have
adopted both the International Plan and the VoiceMail Plan  even though this
subset represents only 2.4% of the customers but a well defined one and the
Kohonen network uncovered it
• Next we interpret the clusters according to the continuous predictors

• The following figure provides information about how the values of all the variables
are distributed among the clusters, with one column per cluster and one row per
variable

• The darker rows indicate the more important variables, that is, the variables that
proved more useful for discriminating among the clusters
• Consider Account Length_Z. Cluster 00 contains customers who tend to have been
with the company for a long time, that is, their account lengths tend to be on the
large side. Contrast this with Cluster 20, whose customers tend to be fairly new
• For the continuous variables, the data analyst should report the means for each
variable, for each cluster, along with an assessment of whether the difference in
means across the clusters is significant
• It is important that the means reported to the client appear on the original
(untransformed) scale, and not on the Z scale or min–max scale, so that the client
may better understand the clusters

• The following table provides these means, along with the results of an analysis of
variance for assessing whether the difference in means across clusters is significant

• Each row contains the information for one numerical variable, with one analysis of
variance for each row. Each cell contains the cluster mean, standard deviation,
standard error (standard deviation∕√cluster count), and cluster count
• The degrees of freedom are df1 = k - 1 = 6 - 1 = 5 and df2 = N - k = 3333 - 6 = 3327
• The F-test statistic is the value of F = MSTR∕MSE for the analysis of variance for that
particular variable, and the Importance statistic is simply (1 – p)-value, where p-
value = P(F > F test statistic)
• Account length and the number of voice mail messages as the two most important
numerical variables for discriminating among clusters
• Next, graphically it is seen that the account length for Cluster 00 is greater than that
of Cluster 20 which also is supported by the statistics in the last table, which shows
that the mean account length of 141.508 days for Cluster 00 and 61.707 days for
Cluster 20
• Also, tiny Cluster 12 has the highest mean number of voice mail messages (31.662),
with Cluster 02 also having a large amount (29.229)
• Finally, note that the neighborliness of Kohonen clusters tends to make neighboring
clusters similar. It would have been surprising, for example to find a cluster with
141.508 mean account length right next to a cluster with 61.707 mean account
length. In fact, this did not happen

A Bigger Example – Cluster Profiles
• Cluster 00: Loyal Non-Adopters. Belonging to neither the VoiceMail Plan nor the
International Plan, customers in large Cluster 00 have nevertheless been with the
company the longest, with by far the largest mean account length, which may be
related to the largest number of calls to customer service. This cluster exhibits the
lowest average minutes usage for day minutes and international minutes, and the
second lowest evening minutes and night minutes
• Cluster 02: Voice Mail Users. This large cluster contains members of the VoiceMail
Plan, with therefore a high mean number of voice mail messages, and no members
of the International Plan. Otherwise, the cluster tends toward the middle of the
pack for the other variables
• Cluster 10: Average Customers. Customers in this medium-sized cluster belong to
neither the Voice Mail Plan nor the International Plan. Except for the second-
largest mean number of calls to customer service, this cluster otherwise tends
toward the average values for the other variables
• Cluster 12: Power Customers. This smallest cluster contains customers who belong
to both the VoiceMail Plan and the International Plan. These sophisticated
customers also lead the pack in usage minutes across three categories and are in
second place in the other category. They also have the fewest average calls to
customer service. The company should keep a watchful eye on this cluster, as they
may represent a highly profitable group.

A Bigger Example – Cluster Profiles
• Cluster 20: Newbie Non-AdoptersUsers. Belonging to neither the VoiceMail
Plan nor the International Plan, customers in large Cluster 00 represent the
company’s newest customers, on average, with easily the shortest mean account
length. These customers set the pace with the highest mean night minutes
usage
• Cluster 22: International Plan Users. This small cluster contains members
of the International Plan and only a few members of the VoiceMail Plan.
The number of calls to customer service is second lowest, which may
mean that they need a minimum of hand-holding. Besides the lowest mean
night minutes usage, this cluster tends toward average values for the other
variables.
• Cluster profiles may of themselves be of actionable benefit to companies and
researchers  for example, suggest marketing segmentation strategies in an
era of shrinking budgets
• Rather than targeting the entire customer base for a mass mailing, for example,
perhaps only the most profitable customers may be targeted
• Another strategy is to identify those customers whose potential loss would be of
greater harm to the company, such as the customers in Cluster 12 above
• Finally, customer clusters could be identified that exhibit behavior predictive of
churning; intervention with these customers could save them for the company

Sess07 Clustering02_KohonenNet.pptx

Recommended

Recommended

More Related Content

Similar to Sess07 Clustering02_KohonenNet.pptx

Similar to Sess07 Clustering02_KohonenNet.pptx (20)

Recently uploaded

Recently uploaded (20)

Sess07 Clustering02_KohonenNet.pptx