Perceuptal mapping, Factor analysis, cluster analysis, conjoint analysis

N. R. Institute of Business Management (NRIBM-PGDM)
Project
On
Perceuptal mapping, Factor analysis, cluster analysis, conjoint
analysis
Advance Marketing Research
Submitted to
Prof. Jaineel Shah
Submitted by
Term V
Batch -2018-20
Sr No. Name Roll No.
1 Sachin Dubey P 1817
2 Jeena Patel P 1840
3 Vatsal Patel P 1846
4 Shail Rami P 1849
5 Viren Trivedi P 1868

Red Bull Monster
Combu
Sting
Cloud 9
Xtra Power
Caffeine
Sweetnes
s
PerceptualMapping
Brands
Sweetness
(1-5)
caffeine
(1-5)
Market
Share
Red Bull 2 4 20%
Monster 5 4 15%
Combu 5 1 5%
Sting 4 2 10%
Cloud 9 3 3 1%
Xtra Power 1 4 17%
Interpretation:
 From above chart we can interpret that Red Bull covers maximum market even
though being very bitter in taste with 20% market share, so it can be said market
leader in beverage segment.
 Followed by Market challengers like Monster, Sting and Xtra power with 15%, 10%
and 17% respectively as they are good in taste and also have caffeine in it.
 So we can say that people give more preference to caffeine factor then taste factor as
1 being very bad and 5 being very good

FactorAnalysis
Interpretation:
 As we can see in the above KMO & Bartlett’s table that the test is 0.00 and 0.934 that
means that the result of the test is positive and would be accepted.
 So it means that the research done was accurate.
 The other table shows that the cumulative % is 58.478 which is more than 50% so the
test done is accepted and genuine in terms of relevance.

Cluster Analysis
Clustering analysis: A case study of the
environmental data of RAMA-Toluca
Recently, the climatic analysis has been widely studied with artificial intelligence tools. The
importance of this topic is based on the environment impact produced for natural variations
of the data on a certain ecosystem. In this paper, a first study of the meteorological
parameters obtained with the Automatic Network of Atmospheric Monitoring (by its
abbreviation in Spanish, RAMA) of Toluca, Mexico, is exposed. The study period is from
2001 to 2008. RAMA-Toluca includes seven monitoring stations located in the Toluca
Valley. Using clustering algorithms, the experimental results establish the base for
determining the days of distribution in clusters, which could be oriented to the natural cluster
that the days have in climatic seasons. However, the results show a different situation than
the awaited one. With this, the bases for future work are in the climatic analysis context in
Toluca Valley.
INTRODUCTION
The environmental data analysis is one topic that, in the last decades, has had importance in
the scientific community. In this scope, studying the climatic change is the main
environmental problem. The impact of this change has been foreseeable on the hydric
resources, the productive ecosystems, the bio-diversity, the infrastructure, the public health
and generally, on the diverse components included in the development process (Staines,
2007), which threatens the healthy environment and the quality of life.
In the Mexican State, particularly, in the metropolitan zone of the Toluca Valley (MZTV), it
is possible to see that, when a rural place has been over-passed to an industrialized place, due
to the continuous process of urbanization, the natural resources are devastated, and several
environmental problems, like: bad use of the ground and reduction of the agricultural and
forest border, invasion of protected natural areas, deforestation, erosion processes, forest
fires, residues burnt in open- cast, pollution emissions by industries and damaged vehicles,
are found.
For this reason, several artificial intelligence (AI) techniques are proposed to discover and
conduct patterns of climate parameters in the MZTV. The AI is a discipline for developing
software and hardware which can emulate the human actions, for example, mani- pulation
of knowledge, generating conclusions, explaining the human reasoning and conducting it as
if it was a human.
Clustering is the generic name of a great variety of techniques, useful for finding non-
obvious knowledge in large data sets (Kotsiantis and Pintelas, 2004). There are two
technique groups: The non-hierarchic techniques or the partition one and the hierarchic
techniques. The first one separates the data set in k groups, and the second one forms a set of
several differentiation levels (MacKay, 2003). We can find different useful methods for
determining the quality of clusters (Bolshakova et al., 2005). These methods use numerical

measures on the clustering results by inferring the quality and describing the situation of a
certain pattern inside the cluster.
Several studies are developed for handling this problem. Some studies, in which the
climatic change was studied, were: Secretaria del medio ambiente (2007) and Parra-Olea et
al. (2005). In general, the proposals consider a regional study of the climatic changes
(Travasso et al., 2008) for projecting, regionally, the global predictions of the climate models
available and to identify the effects of these changes (Gutierrez and Pons, 2006; Tebaldi and
Knutti, 2009). On the other hand, there are several researchers that use either data mining
(Steinbach et al., 2002; Atem et al., 2004) or clustering methods in different ways. For
example, for discovering ecosystem patterns (Steinbach et al., 2001; Kumar et al., 2001), and
improving the algorithm behavior (Gutiérrezr and Rodríguez, 2004), proposals were made on
the weighted clustering method for analyzing infrequent patterns, or extreme events in the
weather forecasts.
Based on the exposed patterns given previously, the object of this study was to analyze and
discover the information that was inside the data bases provided by RAMA-Toluca. In
particular, we analyzed the meteorological variables using clustering algorithms, for
identifying the grouping in each year of the studied period (2001 – 2008). That is to say, we
can know the distribution of the days between the groups and, in consequence, the seasons
identified by the clustering methods. The paper is organized as follows: The clustering
methods used in the study are exposed, followed by a description of the cluster
validation algorithms which allow a corroboration of the group quality. Then the study
zone and the meteorological parameters evaluated are given in detail, after which the
experimental results are shown. Finally, the concluding remarks and the open lines of study
are given.
CLUSTERING METHODS
The clustering process consists of a division of the data set in groups with similar objects.
For measuring the similarity between objects, usually we use different distance measures,
which are subsequently described in this work.
Adaptive algorithm
The adaptive algorithm (AA) is an incremental heuristic method which uses two parameters:
distance threshold for creating groups
(t) and a T fraction which determines the total confidence (¾). The main function of the
algorithm is to create groups based on t (weighted by ¾). However, the first group settles
down arbitrarily. The main processes of the AA are the following (Bow, 1992):
(i) The first group is determined arbitrarily.
(ii) When a sample was assigned to a certain group, the cluster center must be recalculated.
This process can show that some samples change the cluster.
(iii) It is possible that the samples of a certain cluster change due to the iterative process.
(iv) The algorithm ends when there are no reassignments. At this time, the partition is
considered stable.

K-means algorithm
K-means is a partition algorithm. In this way, similar samples are in the same cluster, and
dissimilar samples are in different clusters (MacKay, 2003). In the process, the algorithm
needs to define a unique parameter k. K defines the number of groups that will be found in
the data set. For this, the K-means uses an iterative process, which starts by defining a
sample prototype (centroid) as a cluster representative and is defined as the average of their
samples. Next, the sample is assigned to the close centroid using a metric, commonly known
as the Euclidean distance. Later, the centroid is recalculated using the new group formed.
This process continues until a criterion is obtained, for example, the epochs number, no
more replacements, etc (Garre et al., 2007).
The algorithm is faster and efficient; nevertheless, it has several limitations, such as, the a-
priori knowledge about the cluster number inside the data set.
Validation algorithms
The cluster analysis consists of the clustering result evaluation, in order to find the partition
that better fits the data (Halkidi et al., 2001). When the conglomerates were created, we
needed to verify their quality through validation of algorithms (Bolshakova et al., 2005).
Cohesion
The cohesion can be defined as the sum of the proximities regarding the prototype (centroid)
of a cluster (Bolshakova et al., 2005). The cohesion is given by:
Where x is the sample contained in cluster i; Ci, is the centroid of cluster i; and proximity is
the squared Euclidean distance.
Separation
The separation between two clusters can be measured by the proximity of the prototypes
(centroids) of two clusters. The separation is given by the next equation:
Where Ci is the centroidof clusteri;C, isthe general prototype (centroid) andproximitycanbe any
metric(Tan etal.,2006).

Station 2001 2002 2003 2004 2005 2006 2007 2008 Total
CE 8627 8746 8528 8567 8104 6910 5546 2661 57689
SL 8171 8639 8692 8549 8441 7973 4087 7761 62313
SM 8213 8480 8478 8535 7909 8316 7828 595 58354
Silhouette coefficient
This method combines two methods, which are cohesion and separation. The following steps
explain the coefficient operation for a single object (Halkidi et al., 2001).
i.) For the i-’th object, the distance average is calculated for all objects that are in the same
cluster, which is called value ai.
ii.) For i-’th object and any cluster that is empty, the distance average is calculated for all the
objects in the next cluster. Finding the minimum value, regarding all clusters, is called bi.
iii.) For i-’th object, the silhouette coefficient is si = (bi - ai)/max(ai,bi). Where max(ai, bi)
will be the maximum value between ai and bi.
The silhouette coefficient can vary between -1 and 1, and the maximum value of the
coefficient is 1 when ai = 0. A negative value is undesirable because this corresponds to the
case when ai is the average distance between the points in the cluster, and it is also greater
than bi, which is the minimum average distance to the points of the other clusters. The best
result desired is when the silhouette coefficient is positive (ai < bi) and when ai is close to 0.
To calculate the silhouette coefficient average (of one cluster), we take the coefficient
average of all the points inside the cluster. A general measurement of a conglomerate can be
obtained by calculating the silhouette coefficient average of all the points (Tan et al., 2006).

Study zone
In the ZMVT, the air quality has been measured since 1993 with 7 monitoring stations, and
it includes seven municipalities in three zones which are shown in Figure 1. The monitoring
stations store environmental data. For this research, the meteorological variables studied are:
TMP (temperature), HR (relative humidity), PA (atmospheric pressure), RS (solar
radiation), VV (wind speed) and DV (wind direction).
The data present several problems related to the monitoring station. These problems
complicate their study, some or which are: faults in the sensor or its hard movements that
are provoked by the wind or other causes. Some meteorological values are inconsistent with
the reality (for example, a temperature of 80°C in winter) that the values are not captured
completely (lost data). When the RAMA administrator identifies some of these problems, it
marks the record for his later consideration.
In this way, the solution found was to choose the average value between the last and next
real data for each feature. With this, we obtain a value in the real rank. On the other hand,
when a sample loses more than 50% of the information, it is considered as noise and as
such, it is eliminated.
The data, used for the study, were provided by 3 monitoring stations that showed different
characteristics like: a great record number and a little lost of information. The monitoring
stations are: Toluca Center (TC), San Lorenzo Tepatitlan (SL) and San Mateo Atenco (SM).
Table 1 shows the number of samples by station and the number patterns per year in each
station.
a) CE station b) SM station c) SL statio
EXPERIMENTAL RESULTS
Here, the results of the clustering algorithms studied on the data base provided by RAMA-
Toluca are shown. Firstly, the data were filtered and, next, the clustering algorithms were
applied: k-means and adaptive algorithm.
Some specifications for the k-means algorithm are the next: the initial seed was chosen
randomly. The k value where k =2, 3, 4 for each data base (Table 1). On the other hand, for

the adaptive algorithm, different thres- holds by the data base were applied, and they were:
100 - 150, 150 - 200, 200 - 250, 250 - 300, 300 - 350, 350 - 400, 400 - 450, and 450 - 500,
for the threshold and T- value, respectively. Figure 2 showed how the samples were grouped
and how many groups were formed. The validity of the conglomerates quality, using the
silhouette coefficient, is displayed in Figure 3. In Figures 2 and 3, it is possible to observe
that when the clusters number is diminished, the quality is greater. This indicates that the
best clustering is when the algorithm finds two clusters. On the other hand, Table 2 includes
the samples grouped in each clustering results (in the case of two clusters).
The figures reflect the convergence existing between these two clustering algorithms, when
both of them are found in the two groups. Regarding the groups’ samples, it is possible to
observe that one of the groups is bigger than the other one with almost a double quantity of
the samples.
CE station
M Station

SL Station
Conclusions
Throughout the year, the climate changes according to the cultural season. The hypothesis
establishes that there are four seasons in one year. For this reason, we expect to find four
conglomerates in the data set provided by RAMA-Toluca, because of the similarities between
the samples of each season. Nevertheless, with the analysis exposed here, it was possible to
identify that with the meteorological data analyzed, the clustering algorithms found only two
great groups. In order to validate the quality of the clusters, the silhouette coefficient,
cohesion and separation were used.
The preliminary results, exposed here, could indicate that in the meteorological data studied,
the values of the samples of each year have similar features, mainly, of two seasons. In
addition, due to the insignificant differences from each conglomerate, it is possible to
suppose that, any climatic variation could happen before the year 2001.

With these results, it is possible to establish the bases of future works in this important topic,
but several questions needs an answer: Is the cluster number equal to the season number? Is
the behavior due to the climatic change? For answering these questions and obtaining a wider
analysis, we came in contact with meteorological experts. During this time, we were in touch
with the Environmental engineering group of the Technological institute of Toluca and
Environment Secretariat in Toluca for improving the analysis. As such, we are sure that in a
future work, we are going to expose the new analysis.
About the research in process, we are working with the unsupervised neural network SOM
(Tan et al., 2006) for comparing several scenarios, for example, between 2001 and 2002 to
2008. The linear regression and correlation analysis is done by another study in process that
would soon be finished.
The open lines point to the study of other data bases with information of more years and of
the other states or countries. Also, it is possible to include the analysis of other years and
other climatic data bases, as well as to use other algorithms such as ISODATA and DBSCAN
(Martín et al., 1996). In the same way, we analyze the convenience of including other
validation methods and studying methods for handling the lost data.

Conjoint analysis
A Case Study of Behavior-driven Conjoint Analysis
on Yahoo! Front Page Today Module
Since the advent of conjoint methods in marketing research pioneered by Green and Rao [9],
research on theoretical methodologies and pragmatic issues has thrived. Conjoint analysis is
one of the most popular marketing research methodologies to assess users’ preferences on
various objective
characteristics in products or services. Analysis of trade-offs, driven by heterogeneous
preferences on benefits
derived from product attributes, provides critical input for many marketing decisions, e.g.
optimal design of new products, target market selection, and product pricing. It is also an
analytical tool for predicting users’ plausible reactions to new products or services.
In practice, a set of categorical or quantitative attributes is collected to represent products or
services of interest, while a user’s preference on a specific attribute is quantified by a utility
function (also called partworth function). While there exist several ways to specify a conjoint
model, additive models that linearly sum up individual partworth functions are the most
popular selection.
As a measurement technique for quantifying users’ preferences on product attributes (or
partworths), conjoint analysis always consists of a series of steps, including stimulus
representation, feedback collection and estimation methods. Stimulus representation involves
development of stimuli based on a number of salient attributes (hypothetical profiles or
choice sets) and presentation of stimuli to appropriate respondents. Based on the nature of
users’ response to the stimuli, popular conjoint analysis approaches are either choice-based or
ratings-based. Recent developments of estimation methods comprise hierarchical Bayesian
(HB) methods [15], polyhedral adaptive estimation [19], Support Vector Machines [2, 7] etc.
We summarize three main differences between Web-based conjoint analysis and the
traditional one in the following:
 The Web content may have various stimuli that potentially contain many
psychologically related attributes, rather than predefined attributes of interest in
traditional experimental design. Meanwhile, most of users are casual or new visitors
who declare part or none of their personal information and interests. Since we have to
extract attributes or discover latent features in profiling both content stimuli and users,
parameter estimation methods become more challenging than that in the traditional
situation;
 In feedback collection, most of respondents haven’t experienced strong incentives to
expend their cognitive resources on the prominent but unsolicited content. This issue
causes a relatively high rate of false negative
 The sample size considered in traditional conjoint analysis is usually less than a
thousand, whereas it is common in modern e-business applications to observe millions

of responses in a short time, e.g. in a few hours.The large-scale data sets make the
traditional conjoint analysis coupled with sophisticated Monte Carlo simulation for
parameter estimation computationally prohibitive.
In this paper, we conduct a case study of conjoint analysis on click through stream to
understand users’ intentions. We construct features to represent the Web content, and collect
user information across the Yahoo! network. The partworth function is optimized in tensor
regression framework via gradient descent methods on large scale samples. In the partworth
space, we apply clustering algorithms to identifying meaningful segments with distinct
behavior pattern. These segments result in significant CTR lift over both the unsegmented
baseline and two demographic segmentation methods in offline and online tests on the
Yahoo! Front Page Today Module application. Also by analyzing characteristicsof user
segments, we obtain interesting insight of users’ intention and behavior that could be applied
for market campaigns and user targeting. The knowledge could be
further utilized to help editors for content management.
PROBLEM SETTING
In this section, we first describe our problem domain and our motivations for this research
work. Then we describe our data set and define some notations.
2.1 Today Module
Today Module is the most prominent panel on Yahoo! Front Page, which is also one of the
most popular pages
on the Internet, see a snapshot in Figure 1. The default “Featured” tab in Today Module
highlights one of four highquality articles selected from a daily-refreshed article pool curated
by human editors. As illustrated in Figure 1, there are four articles at footer positions, indexed
by F1, F2, F3 and F4 respectively. Each article is represented by a small picture and a title.
One of the four articles is highlighted at the story position, which is featured by a large
picture, a title and a short summary along with related links. At default, the article at F1 is
highlighted at the story position. A user can click on the highlighted article at the story
position to read more details if the user is interested in the article. The event is recorded as a
“story click”. If a user is interested in
the articles at F2»F4 positions, she can highlight the article at the story position by clicking
on the footer position.

One of our goals is to increase user activities, measured by overall CTR, on the Today
Module. To draw visitors’ attention and increase the number of clicks, we would like to rank
available articles according to visitors’ interests, and to highlight the most attractive article at
the F1 position. In ourprevious research [1] we developed an Estimated Most Popular
algorithm (EMP), which estimates CTR of availablearticles in near real-time by a Kalman
filter, and presents the article of the highest estimated CTR at the F1 position.Note that there
is no personalized service in that system. i.e. the article shown at F1 is the same to all visitors
at a given time. In this work we would like to further boost overall CTR by launching a
partially personalized service. User segments which are determined by conjoint analysis will
be served with different content according to segmental interests. Articles with the highest
segmental CTR will be served to user segments respectively.
Data Collection
We collected three sets of data, including content features,user profiles and interactive data
between users and articles. Each article is summarized by a set of features, such as topic
categories, sub-topics, URL resources, etc. Each visitor is profiled by a set of attributes as
well, e.g. age, gender, residential location, Yahoo! property usage, etc. Here we simply
selected a set of informative attributes to represent users and articles. Gauch et al. [8] gave an
extensive review on various profiling techniques.
There are multiple treatments on users’ reactions in modelling the partworth utility, such as ²
Choice-based responses: We only consider whether an article has been clicked by a visitor,
while ignoring repeated views and clicks. In this case, an observed response is simply binary,
click or not; ² Poisson-based responses: The number of clicks we observed on each
article/user pair is considered as a realization from a Poisson distribution;
² Metric-based responses: We consider repeated views and clicks and treat CTR of articles by
each user as
target.
In the Today Module setting, Poisson-based and metricbased responses might be vulnerable
by the high rate of false negative observations. Thus we follow the choice-based responses
only in this work.
Notations
Let index the i-th user as xi, a D £ 1 vector of user features, and the j-th content item as zj , a
C£1 vector of article features. We denote by rij the interaction between the user xi and the
item zj , where rij 2 f¡1; +1g for “view” event and “story click” event respectively. We only
observe interactions on a small subset of all possible user/article pairs, and denote by O the
set of observations frijg.
TENSOR SEGMENTATION
In this section, we employ logistic tensor regression coupled with efficient gradient-descent
methods to estimate the partworth function conjointly on large data sets. In the users’
partworth space, we further apply clustering techniques to segmenting users. Note that we
consider the cases of millions of users and thousands of articles. The number of observed
interactions between user/article pairs could be tens of million.

Tensor Indicator
We first define an indicator as a parametric function of the tensor product of both article
features zj and user attributes xi as follows:
Where D and C are the dimensionality of user and content features respectively, zj;a denotes
the a-th feature of zj , and xi;b denotes the b-th feature of xi. The weight variable wab is
independent of user and content features, which represents affinity of these two features xi;b
and zj;a in interactions. In matrix form, eq(3) can be rewritten as
Where W denotes a D £ C matrix with entries fwabg. The partworths of the user xi on article
attributes is evaluated as W>xi, denoted as ˜xi, a vector of the same length of zj . The tensor
product above, also known as a bilinear model, can be regarded as a special case in the
Tucker family [5], which have been extensively studied in literature and applications. For
example, Tenenbaum and Freeman [18] developed a bilinear model for separating “style” and
“content” in images, and recently Chu and Ghahramani [3] derived a probabilistic framework
of the Tucker family for modelling structural dependency from partially observed
highdimensional array data.
Logistic Regression
Conventionally the tensor indicator is related to an observed binary event by a logistic
function. In our particular application, we found three additions in need:
User-specific bias: Users’ activity levels are quite different. Some are active clickers, while
some might be casual users. We introduce a bias term for each user,denoted as ¹i for user i;
Article-specific bias: Articles have different popularity. We have a bias term for each article
as well, denoted as °j for article j;
Global offset: Since the number of click events is much smaller than that of view events in
our observations, the classification problem is heavily imbalanced. Thus we introduce a
global offset ¶ to take this situation into account.
Clustering
With the optimal coefficientsW in hand, we compute the partworths for each training user by
˜xi =W>xt. The vector ˜xi represents the user’s preferences on article attributes. In the
partworth space spanned by f˜xig, we further apply a clustering technique, e.g. K-means [13],
to classify training users having similar preferences into segments. The number of clusters
can be determined by validation in offline analysis. For an existing or new user, we can
predict her partworths by ˜xt =W>xt, where xt is the vector of user features. Then her
segment membership can be determined by the shortest distance between the partworth
vector and the centroids of
clusters, i.e.

Offline Analysis
For each user in test, we computed her membership first as in eq(4), and sorted all available
articles in descending order according to their CTR in the test user’s segment at the time
stamp of the event. On click events, we measured the rank position of the article being
clicked by the user. The performance metric we used in offline analysis is the number of
clicks in top four rank positions.
We varied the number of clusters from 1 to 20, and presented the corresponding results of the
click portion at the top rank position in Figure 2. Note that the CTR estimation within
segments suffers from low-traffic issues when the number of segments is large. We observed
the best validation performance at 8 clusters, but the difference compared with that at 5
clusters is not statistically significant. Thus we selected 5 clusters in our application.
Segment Analysis
We collected some characteristics in the 5 segments we discovered. On the September data,
we identified cluster

membership for all users, and plotted the population distribution in the 5 segments as a pie
chart in Figure 4. The largest cluster takes 32% of users, while the smallest cluster contains
10% of users.
 Cluster c1 is of mostly female users under age 34;
 Cluster c2 is of mostly male users under age 44;
 Cluster c3 is for female users above age 30;
 Cluster c4 is of mainly male users above age 35;
 Cluster c5 is predominantly non-U.S. users.
We also observed that c1 and c2 contains a small portion of users above age 55, and c3 has
some young female users as well. Here, cluster membership is not solely determined by
demographic information, though the demographic information gives a very strong signal. It
is users’ behavior that reveals users’ interest on article topics.
CONCLUSIONS
In this study, we executed conjoint analysis on a largescale click through stream of Yahoo!
Front Page Today Module. We validated the segments discovered in conjoint analysis by
conducting offline and online tests. We analysed characteristics of users in segments and also
found different visiting patterns of segments. The insight on user intention at segment level
we found in this study could be exploited to enhance user engagement on the Today Module
by assisting editors on article content management. In this study, a user can belong to only
one segment. We would like to exploit other clustering techniques, such as Gaussian mixture
models, which allow for multiple membership, and then a user’s preference might be
determined by a weighted sum of several segmental preferences. We plan to pursue this
direction in the future.

Perceuptal mapping, Factor analysis, cluster analysis, conjoint analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Perceuptal mapping, Factor analysis, cluster analysis, conjoint analysis

Similar to Perceuptal mapping, Factor analysis, cluster analysis, conjoint analysis (20)

More from Vatsal Patel

More from Vatsal Patel (12)

Recently uploaded

Recently uploaded (20)

Perceuptal mapping, Factor analysis, cluster analysis, conjoint analysis