MSDS - 453
Demand Space
Analysis Using NLP
Esteban Ribero
MSDS - 453
Assignment #4
Esteban Ribero
Summary
This report describes the exploration and application of methods used to
automate a Demand Space Analysis using search queries. The search queries
represent the demand for content seek out from search engines and advertisers
by consumers of a product category such as makeup. Several approaches to
vectorize the search queries, from traditional methods such as TFIDF to pre-
trained word embeddings, as well as several topic modeling approaches including
k-means clustering using the original matrices or matrices with reduced
dimensionality were used and learnings described. An Intent ontology was also
used to classify search queries based on intent similarity, not semantic similarity,
and the combination of both approaches is described. I conclude that the
statistical low-level method for clustering search queries based on semantic
similarity should be crossed with the top-down approach using the intent
ontology to cluster search queries based on intent similarity for best results.
Several visualizations and analysis were coded in a Python script to automate
most of the analysis and provide a reliable way to replicate this for other product
categories as needed it.
Challenge
At Performics we manage paid (SEM) and organic search (SEO) programs for fortune 500
companies. One of the key challenges we face is to understand what the demand is for content
seek out by consumers so we can design strategies to fulfill that demand, either with paid ads
or with organic content on the brands’ websites to be picked up organically by the search
engines. One of the analyses that we do is a Demand Space Analysis where we take a big
sample of search queries and group them by topic and by intent. We then append the search
volume for those queries to get a true view of the size of the demand (number of search
queries) by topic and intent.
Most of this process is done manually making it time-consuming and resource-intensive. This
report describes the methods and process that were designed to automate most of the
analysis. To illustrate the process I will use a sample of search queries from the makeup
category. The following section describes the data.
Data
The dataset is composed of 9,360 unique search queries and their individual search volume
from November 2017 to October 2018. These search queries are collected using a proprietary
tool that extracts search queries from Google by seeding a term (i.e. makeup) in the Google
search bar and collecting the suggested search queries that other users tend to use when
searching for the seed term, then using the suggested search queries as the seeds in the next
round. After a couple of iterations, the system generates a sizable amount of search queries
related in some degree to the original seed term. The search queries are composed of as few as
1 single word or up to 8 words, with most being of size 3 to 4.
Characteristics of a Demand Space Analysis
The Demand Space Analysis is the classification of search queries on separate but related
dimensions. One dimension is the semantic similarity of the search queries, the other
dimension is the intent similarity of the search queries. These dimensions are meant to be
different since they capture different aspects of the goals of searchers who employ them to get
information from search engines and advertisers. The semantic similarity should capture the
topics being searched while the intent similarity should represent the goals that the searcher is
looking to fulfill as he or she learns about a product category.
To cluster search queries based on intent similarity we use an Intent Ontology that I created for
the company. The following section describes the Ontology and the way we use it.
Clustering Based on Intent Similarity
Figure 1 shows the Intent Ontology. From the bottom-up, we have the search queries and we
infer the searcher’s intent (goal) depending on the use of words in the search query.
Figure 1. Intent Ontology
Search queries may contain Questions, Products or Services, Descriptors, and Brands (ours,
competitors or retailers). The search queries may also contain Buying signals such as “where to
buy”, “buy”, or “coupons” or Navigational signals such a “my.brand.com” or “login” (these are
rare and not relevant to the makeup example category). Depending on the combination of such
terms we infer the Goal of the searcher: To Understand things about the category,
to Explore the category and its options, to Evaluate a particular brand or compare brands,
to Buy the product right away, or if it is a current user, to find ways to better Enjoy the
products they already have.
The Goal (Intent) layer is the layer that we care the most since, to us, knowing what the
consumer wants is what makes an ad helpful and relevant vs intrusive and annoying. Behind
each goal there is a Searcher with all kinds of attributes but, unless we are the search engine,
we can only infer if the searcher is a current user of the brand or a prospective customer. To
personify the goal, we refer to those as the “Mindsets” that people get in and out of as they
explore the category and move through the purchase journey. They represent distinct ways of
thinking and levels of abstraction to construe the world.
The psychological theory behind this is called Construal Level, which refers to how humans
represent the world in their minds when they think about it. We can think of the word in very
abstract ways or very concrete ways and this seems to vary depending on how close we are
from accomplishing a goal. When we are farther away from a goal, but we are thinking about it,
we think in an abstract way and tend to use questions or be concerned about the meaning of
things. When we get closer to our goals and are thinking about them, we think more concretely
and pay attention to specific aspects of reality. In the context of buying a product, this means
we pay attention to the specific features or a product, such as its brand, price, or place to buy.
The above ontology represents the way we have operationalized this thinking to make it
practical. We use a set of rules to decide where the search query goes. The rules are applied
consecutively and the first one overrules the next in the line. For instance, if it contains a
question term such as why, how or meaning, it immediately goes to the Understanding mindset
no matter what other term appears in the query. This supersedes any other rule. If the search
query contains a navigational cue such as my or login it goes to the Enjoying mindset. If it
contains a buying signal such as near me or buy (but not a navigational cue) it gets assigned to
the Buying mindset. If the search query contains a brand term (but not a buying signal) it gets
assigned to the Evaluating mindset. Sometimes we distinguish between our brands and others
to break this group into Evaluating Us vs Evaluating Others. Whatever remains get classified
as Exploring. The search queries in the exploring mindset typically contain terms referring to
products and services such as mascara, foundation eyeshadow, and adjective or descriptors
such as best, top, black, waterproof. Since in certain categories most of the search queries may
fall into this bucket, we sometimes break it into two by distinguishing the search queries that
only contain the product and service (Exploring Broad) from those that contain additional
descriptors (Exploring Narrow).
This classification has proved to be very useful. When we match the ad and landing page or
website content to the Mindset of the searcher with language that speaks to it, we have seen
significant improvements in the performance of the ad as measured by higher click-thru-rates
(the percent of people that click on an ad after seeing it).
Using the Ontology
The set of rules explained above has been coded into an algorithm that relies on a lexicon to
help the machine distinguish between the different types of terms. This lexicon is created
specifically for each category and the different terms in the lexicon are basically Equivalent
Classes for the different types of words. For instance, the table 1 contains the “question” terms,
the “buying signals” terms, and the list of “our brands” in the lexicon.
Notice that we use different spellings of the words as these variations appear often in search
queries. The full lexicon for this exercise contains 204 words. The largest Equivalent Classes are
the “Other brands” with 85 terms and “Product and Services” with 61 terms. The full lexicon is a
Reference Term Vector and is used as described above to classify the search queries by intent.
Table 1. Sample of Terms From the Intent Lexicon
Figure 2 shows the breakdown of the search volume by the more general Mindsets. Since
Enjoying is not relevant for this category we omit it for the analysis.
Figure 2. Percent of Search Volume by Mindset
As can be seen, consumers approach the category mostly with an Evaluating mindset or an
Exploring mindset. These two mindsets represent 59.5% and 27.8% of the total search volume.
Understanding represents only 4.8% and Buying 7.9%. To split the two big blocks into smaller
groupings, I divided them into Evaluating Us and Evaluating Others, and Exploring Broad
Narrow per the description in the Intent Ontology. This is now done automatically for every
data set that is passed though the python script containing the intent classification algorithm.
Figure 3. Percent of Search Volume by the Full Set of Mindsets
The full set of mindsets provide a more granular view of the demand for content. Evaluating
Others is still a very big segment representing 49.5% of the search volume, while Evaluating Us
represents 10%. Exploring Narrow and Exploring Broad represent now 20.6% and 7.3%
respectively. On closer inspection, the great amount of search volume under Exploring Others is
driven a by the single term “Sephora” that accounts for 39,850,000 searches per year which
represents 59.6% of the Evaluating Others mindset and 29.3% of the Total Search Volume for
the entire category! That is impressive. This also illustrates a typical phenomenon in search and
where the majority of the volume is driven by a few keywords but there is a very long tail of
other keywords driving the rest for the category. For instance “Sephora” is just is 1 term among
9,360 but represents almost 1/3 of the entire demand for content for the makeup category.
This highlights the need to use the search volume as a way to weight the importance of each
term in any clustering or classification system and avoid relying solely on word frequencies
within the corpus.
This initial Intent Segmentation already provides insight and guidance for strategy. The demand
for our brands, in this example L’Oréal brands, is only 10% of the demand, so relying only on
L’Oréal brands’ branded terms for search ads or search engine optimization tactics will miss
most of the opportunity to capture and convert demand for the category. While Evaluating
Others is a difficult place to play in, since consumers are looking specifically about information
for competitors it is worth exploring opportunities for conquesting on specific branded terms
from competitors or in the case of Sephora, making sure the brand is sold in that particular
retailer and L’Oréal brands are clearly associated with Sephora. The next best place to search
for opportunities to increase presence via either paid ads or organic search results is in the
Exploring mindset. Exploring Broad are mostly terms such as “makeup brushes”, “eyeshadow
palette”, “make up kit”, etc. These are good terms to strategize on but also tend to be very
expensive, so the next best bucket is in the Exploring Narrow. Which contain a much varied set
of terms such as “best mascara”, “eye shadow looks”, “best foundation for oily skin”, “makeup
for hooded eyes”, etc.
A more detailed view of the different topics under each Mindset is warranted and so a semantic
similarity clustering method that is crossed with the intent similarity method would be of great
use.
Clustering Based on Semantic Similarity
A deep exploration of methods of Natural Language Processing was performed to arrive at the
optimal solution. The full exploration of methods is presented in the appendix. The following
section briefly describe the approaches and key learnings.
Method Exploration Summary
For the semantic similarity, I used two types of pre-processing methods to tokenize the search
queries: one with heavy pre-processing using stemming, removing punctuation and stop words,
the other with minimal preprocessing simply tokenizing the search query and lowercasing the
words. For vectorization of the search queries, I initially used TFIDF, Doc2Vec and Word2Vec
trained specifically on the corpus. For Word2Vec I averaged the vectors of each word in the
search query to get to a single vector per search query. I then used the k-means method to
cluster the search queries and compared approaches. I used the elbow method to identify the
optimal number of groups with each method and compared the results. Doc2Vec and averaging
Word2Vec vectors produced unsatisfactory results suggesting too few groups, and surprisingly,
the TFIDF method performed the best giving a much wider options of clustering solutions.
When clustering the terms into 10 segments I got a decent classification, since the terms in
clusters appeared to have similar meaning but the variation in the size of the clusters was a bit
concerning.
To explore other options for topic modeling I performed two additional approaches with the
TFIDF matrix method. A Latent Dirichlet Allocation with 10 topics and a Latent Semantic
Analysis with Truncated SVD by reducing the dimensionality of the data to 30 topics. The results
from the LDA analysis were hard to interpret and the topics did not appear to be clearly
differentiated. The LSA was promising but the 30 topics only represented 8.9% of the variance
in the data as so these approaches were discarded.
These results suggested that the algorithms where having a hard time identifying differences
between the search queries most likely due to the lack of context and richness in the text given
that the search queries are short in nature. This gave me the idea that we could increase the
differences between search queries by borrowing contextual information from pre-trained
word embedding.
Using Pre-Trained Word Embeddings for Knowledge Transfer
Transferring knowledge with pre-trained word embeddings is an ideal solution since the
embeddings capture in them the context and semantic meaning of the words used in rich and
dense collections of text. I chose GloVe.6B.100d because of the nature of the corpus, it was
trained on, which is the entire Wikipedia corpus. This is preferable to the original Word2Vec
trained on Google News because it is more likely to contain terms from my corpus such as
product names like mascara, references to parts of the face such as eyelashes, or brand terms
such as Sephora. Wikipedia is, in fact, a common source of knowledge and references for
brands and products, so it seems ideal.
I used the Glove pre-trained word embeddings to create search query vectors by averaging the
vectors of the words in the search queries. There were only 17 search queries out of the 9360
queries from the entire corpus that did not contain any word included in the Glove lexicon.
These search queries were eliminated from further analysis providing an effective sample of
9343 search queries (documents).
Given the relative success of reducing the dimensionality of the data with Truncated SVD on the
TFIDF Matrix (see appendix for more context) I reduced the dimensionality of the data from 100
(the length of the Glove vectors) to 30 by performing a PCA analysis and extracting the top 30
components. These components explain 78% of the variance and so the approach felt
appropriate.
An exploration of the optimal number of k to use in a k-means clustering exercise the elbow
method was repeated with the reduced pre-trained GloVe embeddings. Figure 4 shows the
result. The method is suggesting that 4 to 8 groups would be ideal. I decided to continue with 6
groups. Figure 5 shows the results when visualizing using the t-SNE method and two
components.
Figure 4. Elbow method for k-means with PCA reduced GloVe vectors.
Figure 5. Clustering based on semantic similarity into 6 groups and using t-SNE
Cluster Buying
Evaluating
Others
Evaluating Us
Exploring
Broad
Exploring
Narrow
Understanding Total
0 1.4 35.2 5.9 1.3 0.2 0.1 44.3
1 6.2 4.7 0.9 1.5 0 13.5
2 0 0.5 0.1 0.5 5.5 1.8 8.6
3 0.1 2.7 0.1 2.4 5.1 1.2 11.6
4 0 3.8 1.2 2.7 4.9 1.3 14
5 0.1 2.4 1.7 0.2 3.2 0.2 8
Total_ 7.8 49.2 9.9 7.1 20.5 4.8 100
As can be seen in figure 5, the search queries appear to cluster decently into somewhat distinct
clusters. The grouping is not perfect but it is showing a level of coherence that is satisfactory. As
stated above, the goal is not to use these groupings by themselves but to crossing them with
the Mindset segmentation using the Intent ontology. Table 1 shows the result of such crossing.
The values represent the search volume generated by the queries in each bucket, not the
number of search queries in each bucket. This weighted table is a better tool to identify the
best way to create a set of final segments that combine intent similarity with semantic
similarity. These final segments are called Intent Segments since they contain clearly defined
topics with a specific Intent behind.
Table 1. Intent similarity (Mindsets) and Semantic similarity (Clusters)
As can be seen from the Total column on the right, Cluster 0 represents the biggest semantic cluster.
This is most likely because it contains most of the branded terms such as Sephora and other big drivers
of search volume. Clusters 2, 3 and 4 fall primarily under Exploring Narrow, indicating that these cluster
contain different types of descriptors. Cluster 1 fall mostly under Buying, and Evaluating Others
indicating that this clusters contains terms such as “promos”, “coupons” combined with branded terms,
probably “Sephora”. The small discrepancies in the size of each Mindset between Table 1 and Figure 3 is
due to the search queries that were eliminated during the knowledge transfer from the GloVe vectors
where there were queries without word embeddings.
Defining Intent Segments
Up to this point, the process for performing the Demand Space Analysis would be fully automated. The
analyst would then look at a table like the one above to decide which final grouping to define. The
guidance would be to identify buckets where there is at least 4% of the volume and keep those as
unique intent segments. The buckets with less than 4% of the search volume should be merged under
each Mindset. For instance, using the above table the user would determine that under Buying, Cluster
0 should be merged with Cluster 2, 3, 4 and 5. Leaving only two segments under Buying. Cluster 2 to 5
should be merged under Evaluating Others. Cluster 1 to 5 should be merged under Evaluating Us. All the
clusters under Exploring Broad should be merged. Cluster 0, 1 and 5 should be merged under Exploring
Narrow. And finally, all the clusters under Understanding should be merged.
Figure 6 shows the search volume represented by the newly defined Intent Segments. And figure 7 and
8 show the top bigrams and unigrams for the major intent segments, extracted using CountVectorizer
and weighted by the search volume represented from the search queries they appear on.
Figure 6. Percent of Search Volume by Intent Segment
As can be seen, the volume is now spread more evenly among smaller groups except for
Evaluating_Others_1 which is the one containing “Sephora” and therefore it is almost impossible to split
even further. Evaluating_Others_ turned out to become a big segment. All the Intent Segments ending
on Segment_Name_ without a number at the end, are the ones that resulted from combining all the
buckets within each Mindset that had less than 4%. This segment could have been analyzed further but
was left alone for the more detailed inspection in Figures 7 and 8.
Figure 7 and 8 provide a glimpse about what the segments are about. Understanding_1 is about makeup
tutorials, how to draw eyes and how to apply the different products. Exploring_Broad_1 is about
makeup brushes, eyeshadow palette, makeup kit, contouring, eyebrows products, etc.
Evaluating_Narrow_1 is about all the best products with several mentions to the type of skin.
Evaluating_Narrow_2 is about the more general makeup searches, like makeup look, best makeup.
Exploring_Narrow_3 is all about foundation, best foundation, foundation coverage, best drugstore
foundation. Evaluating_Others_1 is the big one where sephora takes the lead. Notice that sephora is not
in the bigrams for this intent segment in Figure 6, but it is by far the bigger unigram for the segment as
shown in Figure 8. bh cosmetics is also an important term in this segment. Mac cosmetics, elf cosmetics
and estee lauder are key to Evaluating_Others_2. As mentioned before Sephora is a retailer where a lot
of the L’Oreal brands are sold, so it is not really a competitor but a partner. While Mac Cosmetics and BH
Cosmetics and Estee Lauder are direct competitors. In fact checking the terms for Buying_1 it is easy to
perceive that most of the volume from this intent segment which represents consumers who are ready
to buy comes from searches about near me stores for Sephora and Ulta with some mentions of gifts and
Lancôme. Evaluating_Us_1 is mostly about mascara from l’oreal, Maybelline, lancome and nix
cosmetics, while Evaluating_Us_2 is about foundation, lipstick, eyeliner and those same brands.
Figure 7. Top Bigrams by Intent Segment
Figure 8. Top Unigrams by Intent Segment
It is important to note that the same term can be found in different intent segments and so a
list of the top unigrams and bigrams across the entire corpus irrespective of the Intent segment
would show which terms should be considered in general to guide the SEM and SEO strategy.
Table 2 show the top bigrams and unigrams across all segments. Not surprisingly sephora is the
dominant term, followed by makeup, best, foundation, cosmetics (from bh cosmetics and mac
cosmetics primarily), near, palette, mascara and maybelline. The top bigrams give a more
nuances picture where the volume from the search queries where these terms show up is far
less concentrated on a few of the terms. For comparison, 91.8% of the search volume is
generated by search queries that contain the top 10 unigrams vs only 18.9% by those that
contain the top 10 bigrams. This means making sure we have a strong presence and content
that speaks to the top 10 terms is key to be associated with the category.
Table 2. Top unigrams and bigrams across all segments
This top 10 terms appear in most of the search queries, which explains in part why certain
algorithms struggle to differentiate between the search queries. Another evidence of the
interconnectedness of all the search queries is the network of search queries that is created by
measuring the cosine similarity between each pair and creating a link between queries when
the cosine similarity is above 0.8. Figure 9 shows the graph for the demand space for the
makeup category. The links and the nodes are colored based on the communities or cluster
they belong to.
Figure 9. Graph of the Demand Space for the Makeup Category
As can be seen, there are a few small patches of search queries that are separated from the
core. The great majority of the search queries are concentrated in the middle and
interconnected between them. This is further evidence that is it hard to fully separate the
search queries into distinct buckets based on their semantic similarity and provide support for
the need to classify them based on intent similarity as well as semantic similarity to make sense
of the data and provide practical uses of the analysis.
Dynamic Mindsets Over-Time
One key aspect of the Demand Space Analysis is the seasonality of the different Mindsets. Since
the Mindsets represent consumers’ intentions and not consumer per-se, it is expected that they
will be dynamic and vary over time. To check this characteristic for the makeup category I
plotted the search volume for each mindset indexing the volume for each month to the
maximum across the entire time period analyzed. This way each mindset can be compared on
the same scale and the patterns overtime become more apparent. This is how Google trends
work and it is a useful tool to plan the media flighting and content production/release to better
the need states and intentions of consumers in the marketplace. Figure 10 show the volume
over time for the core Mindsets.
Figure 10. Search Volume Over-Time by Mindset
As can be seen the Buying mindset is highly seasonal, peaking during the holidays where a lot of
gifting and shopping is happening. Exploring and Enjoying are more stable over time. This
suggests that brands, should build the pool of potentials by engaging consumers when they are
Exploring and match the flighting of media and content that speaks to those Evaluating and
Buying to mirror the rise and decay on buying and evaluating behavior. To get a more granular
view of the Exploring and Evaluating mindset Figure 11 and 12 show the behavior over time for
all the sub segments of the mindset.
Figure 11. Search Volume Over-Time by Exploring Intent-Segments
There are important nuances for the Exploring_Broad_1 Intent Segment. It has a stronger peak
during the holidays than the other Exploring mindsets. Exploring_Narrow_1 and
Exploring_Narrow_3 peak in March not during the holidays indicating that there is a higher
demand for content during spring in preparation for the summer. These mindsets are heavily
focus on foundation and mascara so brand would have to keep in mind for the content creation
calendar.
Figure 12. Search Volume Over-Time by Evaluating Intent-Segments
There are also important nuances in the behavior over time of the Evaluating intent segments.
Both segments of Evaluating Others show a similar patterns with a very high difference in
search volume between the holidays and the rest of the year. Searches for retailers like
Sephora are highly seasonal since consumers search for them with the clear intentions to shop
and buy and this behavior is reflected on these trends. Both segments of Evaluating Us are
more stable over time with some differences during the second half of the year. More research
is needed to understand the potential drivers of these small yet noticeable differences.
Conclusion
The use of semantic similarity in combination with intent similarity appears to provide
important nuances and helps us better understand the demand for content seek out by
searchers as they approach a product category. The process shown here through the use of a
large sample of search queries from the makeup category illustrates the value of the different
NLP methods to create a pipeline of tasks and analysis that can help automated the analysis
providing enormous value to the company in terms of resource utilization and time to market.
This type of analysis is usually done when prospecting a new client or when entering a new
category and there is often a time constraint that makes this analysis hard to do on a consistent
basis. The process outlined here where we first use the intent classification with the standard
but flexible Intent Ontology using a customized Reference Term Vector followed by a semantic
similarity clustering using pre-trained word embeddings has been coded in to a Python scrip
and could be run with any similar data set from another category. The initial suggestion of
clusters by the system should guide the analysis define the final intent Segments that best
represent the search volume for the category. Once that manual definition is made, the rest is
also completely automated. The system will generate the treemaps, the words clouds for each
intent segment and the trend lines over time for the overall mindsets and their sub-segments.
There is still an opportunity to explore other clustering algorithms other than the k-means and
to better integrate the graph and network analysis into the core process. At this point the graph
was used only for visualization and illustration purposes but it could become a major way to
explore the data interactively and extract deeper insights. Right now the process of creating the
adjacency matrix takes too much time (it took more than 5 hours to complete for the makeup
example) and the actual visualization of the graph was done in a separate environment using
Gephi. The computational resources needed to do this in an effective way and to provide
interactivity to change, on the fly ,the similarity threshold that determines if links are created or
dropped between search queries and to identify communities and explore them is beyond the
simple laptop we carry in the office, so an exploration of could services to perform this task is
warranted.
As with many Machine Learning and Ai systems, there is room for continuous improvement and
adjustments. So I take the methods and approaches learned here as the starting point and look
forward to build more sophisticated NLP application with this and other type of unstructured
data.
APPENDIX
Method Exploration
Pre-processing
To pre-process the data, I used two methods. One method that tokenizes the search queries,
removes punctuation, lowercases all word, removes stop words and finally reduces the
individual words to their stem using the PorterStemmer. Another method uses minimal pre-
processing by simply tokenizing the search query and lowercasing the words. The first method
was used for all exercises using TFIDF vectorization, Doc2Vec and Word2Vec vectorization
where the word and document embeddings were learned specifically for the corpus using
Python’s Gensim package. The second method was used when loading pre-trained word
embedding from Glove to maximize the number of words identifies in the pre-trained word
embeddings lexicon.
Low-level statistical NLP methods
After pre-processing the data, a clustering exercise was done using the k-means method, first
with the TFIDF vectorization method, then with Doc2Vec vectorization, and then with
Word2Vec document vectorization by averaging the vectors of the words in each search query.
To get an idea of the ideal number of clusters to use for k-means I used the elbow method.
Figure 1 shows the results of the three approaches.
Figure 1. Elbow method for number of clusters using K-means
As can be seen, Doc2Vec produces a vector representation that is not discriminating well
between search queries. The method is basically suggesting that there are only two groups of
distinct search queries. This is unsatisfactory and therefore any additional exploration with
doc2vec was eliminated. Averaging the learned word embeddings does a better job, but still
suggests just a few groups, either 3 or 4. Using the more straightforward TFIDF method gives a
much wider range of options as there is not a clear cut at any point between 1 and 20 groups.
This is preferable since the solution needs enough granularity to help strategists provide
specific content or keyword bid recommendations and so having just a few groups is not
desirable while having too many is impractical.
To further explore the solutions with averaging learned word embeddings I did a cluster
analysis with 4 groups, per the suggestion in the elbow method, and calculated the size of each
cluster. Table 2 shows the results. As can be seen, the method creates a big group with 40% of
the search queries two smaller groups between 24% and 28% and a small group of 8%. This is
further evidence that learning the word embedding from the corpus is not discriminating
enough between the search queries and does not seem to be a good approach.
Table 1. Cluster and size for K-means with word2vec learned embeddings.
The rationale for using learning word embeddings is that the corpus contains a lot of brand
terms and specific product names that are likely to be unique to the corpus. The drawback may
be that many brand terms appear in the same context and so the discriminating power is
therefore reduced. There might be other factors driving this lack of variation but any further
analysis with learned word embedding is left for other time.
To further explore the option of using the TFIDF method to represent the search queries in the
corpus, I ran a cluster analysis with k=10 to get a baseline group to compare. I used 1-grams, 2 -
grams and 3-grams to extract terms and vectorize the search queries. Once the clusters were
created suing the full TFIDF matrix I extracted the top 10 terms per closer to get an idea of what
the clusters are all about. Table 2 shows the results. The clusters are sorted from left to right by
the size of the cluster and the top terms from top to bottom.
Table 2. Cluster size and top 10 terms for the TFIDF method.
As can be seen, the method produces a big cluster with 34% of the search queries followed by a
second big mid-size cluster with 18% of the search queries and two more clusters each with
around 10% of the search queries, and then 4 small clusters with around 6% of the queries and
finally two tiny clusters with 3% and 1% of the queries. This is not ideal, but it is a good place to
start. The big group (cluster 1) seems to be about L’Oréal, Maybelline, eyebrow, and concealer.
Cluster 5 is about makeup, eye makeup, tutorial, and kits. Cluster 7 is about foundation, skin,
and shade. Cluster 2 is primarily about Lancôme and mascara. Cluster 4 is about lipstick. Cluster
9 eyeliner, eyebrow pencil, and Urban Decay, the brand. Cluster 6 is about contour, contour
kits, contour palette, and cream. Cluster 3 is about eyeshadow. Cluster 8 is about drugstore
makeup and Cluster 0 is Ester Lauder, the brand. It is important to note that the clusters are not
as independent as one may think since there are a lot of common terms among them, but they
do seem to capture some of the structural patterns, and topics in the data. Further exploration
is described later in the report.
Topic modeling
To explore other options for topic modeling I performed two additional approaches with the
TFIDF matrix method. A Latent Dirichlet Allocation with 10 topics and a Latent Semantic
Analysis with Truncated SVD.
Latent Dirichlet Allocation
Table 3 shows the topic vectors with the terms and weights for each of the 10 topics. The topics
are somehow intelligible but not clear enough or distinct enough to be useful for our goal.
More exploration is needed to understand the role (if any) that LDA could play in the solution.
For now, we stop the exploration with LDA.
Table 3. LDA topics with TFIDF matrix
Latent Semantic Analysis
The use of LSA was different. I used Truncated SVD to reduce the dimensionality of the TFIDF
matrix to a matrix of 30 topics and then use t-SNE to visualize the data. One important, note is
that the 30 topics created only explain 8.9% of the variation in the data so this may not be
enough to be of practical use. I use t-SNE to further reduce the dimensionality to 2 component
and visualize the data with a scatterplot. Figure 2 shows the results. I layered in the 10 Clusters
created earlier with the full TFIDF matrix for analysis.
Figure 2. t-SNE 2 main components and Clusters using 30 LSA topics extracted from the TFIDF
matrix
Surprisingly, the approach seems to work well despite the huge loss of information from the
dimensionality reduction. There are still serious problems with, Cluster 1, Cluster 7, and Cluster
3. There appear to be 2 buckets of search queries within Cluster 7 and Cluster 3. However, the
biggest problem is with Cluster 1, (the big one -34% of the search queries) where the search
queries are scattered around in several small groups, except for a big group on the bottom left
of the graph.
It is impractical to inspect each of the 10 Clusters in detail when each cluster is composed of
thousands of search queries. So instead of manual inspection, I decided to layer in the grouping
that we have used in the past to classify these search queries based on consumers’ intent. This
classification is the critical layer of an Intent Ontology that will serve as the guiding principle
moving forward and that will be discussed in detail in its own section later in the report. Figure
3 compares the previous t-SNE visualization by cluster with the manual coding of the search
queries based on the Intent Ontology.
Figure 3. Topic clusters and Intent classification
There is not much correspondence between the two types of classification, so no clear insights
can be drawn from the exercise, other than to note that these two classifications seem to be
capturing very different aspects of the search queries, which is not surprising. It is important to
note that the Intent classification is composed of 6 groups while the Cluster analysis done so far
is composed of 10 groups, so no direct correspondence is expected to occur but hopefully,
there is some type of relationship between the topic clusters and the intent clusters in the final
solution.
Knowledge (context) transfer with pre-trained word embeddings
One of the problems with search queries is that they don’t have a lot of context. Many times,
the search query is composed of just a few keywords like “best makeup” or “best mascara
drugstore” instead of a full sentence or question. There are two challenges here: 1) there is
little context to draw deeper semantic meaning and 2) the language is not really “natural”, it is
the way people got trained to search by the search engines over the past 15 years.
Given these challenges, I thought a good solution could be to transfer some of the lacking
context and semantic meaning by using pre-trained vectors. This is an ideal solution since the
embeddings capture in them the context and semantic meaning of the words used in rich and
dense collections of text. I chose GloVe.6B.100d because of the nature of the corpus, it was
trained on, which is the entire Wikipedia corpus. This is preferable to the original Word2Vec
trained on Google News because it is more likely to contain terms from my corpus such as
product names like mascara, references to parts of the face such as eyelashes, or brand terms
such as Sephora. Wikipedia is, in fact, a common source of knowledge and references for
brands and products, so it seems ideal. The 100-dimension vector was chosen for simplicity but
future exploration with other dimensions or with other pre-trained word embeddings like
FastText is desirable to optimize the solution. Having availability of multiple languages is key
since this is a solution that would be ideally used at different Performics offices around the
globe. We are a global company with a network of offices around the world that rely on this
type of analysis of search queries.
I used the Glove pre-trained word embeddings to create search query vectors by averaging the
vectors of the words in the search queries. There were only 17 search queries out of the 9360
queries from the entire corpus that did not contain any word included in the Glove lexicon.
These search queries were eliminated from further analysis providing an effective sample of
9343 search queries (documents).
The same k-means clustering exercise performed before was repeated with the new pre-
trained vectors. Figure 4 shows the result of the elbow method to identify an optimal number
of clusters. Although there is not a clear cut in the graph, the method is suggesting 5 as the
optimal number of clusters. However, for ease of comparison with the method using the TFIDF
matrix, I decided to go with 10 clusters for now.
Figure 4. Elbow Method for Identifying Optimal Number of Clusters with Glove Pre-Trained
Vectors.
To make a direct comparison with the TFIDF method used earlier, I extracted 1-gram and 2-
gram terms for each cluster using the CountVectorizer and counted the frequency of terms in
each cluster. Table 4 shows the top 10 terms in each cluster. This is not exactly the same as
with the TFIDF because the individual terms are not the ones used by the clustering algorithm,
so these terms are not sorted based on their proximity to the centroids of the cluster but by
their frequency within the cluster. Hopefully, these terms still represent the topic of the cluster.
For completeness of the comparison, I estimated the size of each cluster and sorted them from
left to right.
Table 4. Top terms and size of clusters using pre-trained Glove vectors.
As can be seen, this method produces more balanced clusters, the first 4 contain between 17%
and 12% of the search queries, the next 5 between 7% and 10% of the queries and only the last
cluster (Cluster 0) contains less than 5% of the search queries. This is a much desirable solution.
However, these sizes should not be taken too seriously since the real size of these segments
depends on the volume of searches that each search query gets in a time period which can
dramatically change the size of the segments. For instance, the top term Sephora just by itself
drove 39% of the search volume from the entire corpus in November 2017. This is a single term
accounting for most of the demand for content from the makeup category as represented by
our corpus. So, wherever that term falls into will dramatically increase the size of the cluster.
For now, the variability in the size of the clusters in Table 4 is an indication of how well the
method is discriminating between the search queries. And this is an indication that is doing it
well.
In terms of the meaning of each cluster, the single terms do capture some of it but it is not
completely clear. Cluster 1 seems to be about how to apply makeup (an important topic
discovered during the manual classification). Cluster 8 is about makeup tutorials. Cluster 3 and
cluster 4 seem to be about foundation but cluster 3 more about the type of skin while cluster 4
is about reviews for specific brands of foundation. Cluster 6 about eyeshadow palette and
lipstick, and Cluster 0 seems to be about reviews. These are good and bad news. The clustering
solution appears to capture nuances and differences between topics like the ones speculated
from cluster 3 and cluster 4 but it also shows the difficulty to understand what each cluster is
really all about since there are so many individual search queries in it and many share key terms
that it is hard to grasp the meaning of each cluster without a labor-intense check of all the
search queries in each group. For now, less focus on the ability of the method to capture the
inner structure of the data and provide a good starting point. The next couple of charts show
precisely that.
Figure 5 repeats the analysis of using t-SNE to visualize the data by extracting the two main
components and layering the clusters as well as the intent classification. To make a clearer
comparison with the method using the TFIDF method and Truncated SVD, I reduced the
dimensionality of the data from 100 (the length of the Glove vectors) to 30 by performing a PCA
analysis and extracting the top 30 components. These components explain 78% of the variance,
much better than the 8% from the Truncated SVD approach!
Figure 5. Topic clusters and Intent classification with Glove pre-trained vectors and PCA-30
As can be seen, the search queries are now more tightly grouped by the clusters and occupying
distinct areas of the graph. There is some overlap between Cluster 8, 9 and 4 in the middle and
between 5 and 7 at the top but the clusters can be easily identified in different regions. Please
note that the clustering was done with the full 100-dimensional vectors, not the reduced 30
principal components used to feed the t-SNE algorithm, so this can be refined further if the
clustering was done on the reduced data set. Something similar occurred with the intent
classification (ontology). Most of the “Evaluating” group is in the top-right while the “Exploring
Narrow” is in the bottom-left. This distinction was not guaranteed so the Glove embeddings are
certainly capturing semantic similarity as well as intent similarity. This is even more evident
with the “Understanding” group whose search queries are mostly grouped together in the
bottom-right corner. This is encouraging. There are more overlap and correspondence
between the topic clusters and the intent groupings. For instance, cluster 6 seems to fall within
the “Understanding” group. Again, the hope is not to replace the Intent grouping which what
seems to be a topic (semantic) grouping but to add another classification underneath it. Some
correlation is expected between the topics and intent but one does not replace the other.

Demand Space Analysis with NLP

  • 1.
    MSDS - 453 DemandSpace Analysis Using NLP Esteban Ribero
  • 2.
    MSDS - 453 Assignment#4 Esteban Ribero Summary This report describes the exploration and application of methods used to automate a Demand Space Analysis using search queries. The search queries represent the demand for content seek out from search engines and advertisers by consumers of a product category such as makeup. Several approaches to vectorize the search queries, from traditional methods such as TFIDF to pre- trained word embeddings, as well as several topic modeling approaches including k-means clustering using the original matrices or matrices with reduced dimensionality were used and learnings described. An Intent ontology was also used to classify search queries based on intent similarity, not semantic similarity, and the combination of both approaches is described. I conclude that the statistical low-level method for clustering search queries based on semantic similarity should be crossed with the top-down approach using the intent ontology to cluster search queries based on intent similarity for best results. Several visualizations and analysis were coded in a Python script to automate most of the analysis and provide a reliable way to replicate this for other product categories as needed it. Challenge At Performics we manage paid (SEM) and organic search (SEO) programs for fortune 500 companies. One of the key challenges we face is to understand what the demand is for content seek out by consumers so we can design strategies to fulfill that demand, either with paid ads or with organic content on the brands’ websites to be picked up organically by the search engines. One of the analyses that we do is a Demand Space Analysis where we take a big sample of search queries and group them by topic and by intent. We then append the search volume for those queries to get a true view of the size of the demand (number of search queries) by topic and intent. Most of this process is done manually making it time-consuming and resource-intensive. This report describes the methods and process that were designed to automate most of the analysis. To illustrate the process I will use a sample of search queries from the makeup category. The following section describes the data.
  • 3.
    Data The dataset iscomposed of 9,360 unique search queries and their individual search volume from November 2017 to October 2018. These search queries are collected using a proprietary tool that extracts search queries from Google by seeding a term (i.e. makeup) in the Google search bar and collecting the suggested search queries that other users tend to use when searching for the seed term, then using the suggested search queries as the seeds in the next round. After a couple of iterations, the system generates a sizable amount of search queries related in some degree to the original seed term. The search queries are composed of as few as 1 single word or up to 8 words, with most being of size 3 to 4. Characteristics of a Demand Space Analysis The Demand Space Analysis is the classification of search queries on separate but related dimensions. One dimension is the semantic similarity of the search queries, the other dimension is the intent similarity of the search queries. These dimensions are meant to be different since they capture different aspects of the goals of searchers who employ them to get information from search engines and advertisers. The semantic similarity should capture the topics being searched while the intent similarity should represent the goals that the searcher is looking to fulfill as he or she learns about a product category. To cluster search queries based on intent similarity we use an Intent Ontology that I created for the company. The following section describes the Ontology and the way we use it. Clustering Based on Intent Similarity Figure 1 shows the Intent Ontology. From the bottom-up, we have the search queries and we infer the searcher’s intent (goal) depending on the use of words in the search query.
  • 4.
    Figure 1. IntentOntology Search queries may contain Questions, Products or Services, Descriptors, and Brands (ours, competitors or retailers). The search queries may also contain Buying signals such as “where to buy”, “buy”, or “coupons” or Navigational signals such a “my.brand.com” or “login” (these are rare and not relevant to the makeup example category). Depending on the combination of such terms we infer the Goal of the searcher: To Understand things about the category, to Explore the category and its options, to Evaluate a particular brand or compare brands, to Buy the product right away, or if it is a current user, to find ways to better Enjoy the products they already have. The Goal (Intent) layer is the layer that we care the most since, to us, knowing what the consumer wants is what makes an ad helpful and relevant vs intrusive and annoying. Behind each goal there is a Searcher with all kinds of attributes but, unless we are the search engine, we can only infer if the searcher is a current user of the brand or a prospective customer. To personify the goal, we refer to those as the “Mindsets” that people get in and out of as they explore the category and move through the purchase journey. They represent distinct ways of thinking and levels of abstraction to construe the world.
  • 5.
    The psychological theorybehind this is called Construal Level, which refers to how humans represent the world in their minds when they think about it. We can think of the word in very abstract ways or very concrete ways and this seems to vary depending on how close we are from accomplishing a goal. When we are farther away from a goal, but we are thinking about it, we think in an abstract way and tend to use questions or be concerned about the meaning of things. When we get closer to our goals and are thinking about them, we think more concretely and pay attention to specific aspects of reality. In the context of buying a product, this means we pay attention to the specific features or a product, such as its brand, price, or place to buy. The above ontology represents the way we have operationalized this thinking to make it practical. We use a set of rules to decide where the search query goes. The rules are applied consecutively and the first one overrules the next in the line. For instance, if it contains a question term such as why, how or meaning, it immediately goes to the Understanding mindset no matter what other term appears in the query. This supersedes any other rule. If the search query contains a navigational cue such as my or login it goes to the Enjoying mindset. If it contains a buying signal such as near me or buy (but not a navigational cue) it gets assigned to the Buying mindset. If the search query contains a brand term (but not a buying signal) it gets assigned to the Evaluating mindset. Sometimes we distinguish between our brands and others to break this group into Evaluating Us vs Evaluating Others. Whatever remains get classified as Exploring. The search queries in the exploring mindset typically contain terms referring to products and services such as mascara, foundation eyeshadow, and adjective or descriptors such as best, top, black, waterproof. Since in certain categories most of the search queries may fall into this bucket, we sometimes break it into two by distinguishing the search queries that only contain the product and service (Exploring Broad) from those that contain additional descriptors (Exploring Narrow). This classification has proved to be very useful. When we match the ad and landing page or website content to the Mindset of the searcher with language that speaks to it, we have seen significant improvements in the performance of the ad as measured by higher click-thru-rates (the percent of people that click on an ad after seeing it). Using the Ontology The set of rules explained above has been coded into an algorithm that relies on a lexicon to help the machine distinguish between the different types of terms. This lexicon is created specifically for each category and the different terms in the lexicon are basically Equivalent Classes for the different types of words. For instance, the table 1 contains the “question” terms, the “buying signals” terms, and the list of “our brands” in the lexicon.
  • 6.
    Notice that weuse different spellings of the words as these variations appear often in search queries. The full lexicon for this exercise contains 204 words. The largest Equivalent Classes are the “Other brands” with 85 terms and “Product and Services” with 61 terms. The full lexicon is a Reference Term Vector and is used as described above to classify the search queries by intent. Table 1. Sample of Terms From the Intent Lexicon Figure 2 shows the breakdown of the search volume by the more general Mindsets. Since Enjoying is not relevant for this category we omit it for the analysis. Figure 2. Percent of Search Volume by Mindset
  • 7.
    As can beseen, consumers approach the category mostly with an Evaluating mindset or an Exploring mindset. These two mindsets represent 59.5% and 27.8% of the total search volume. Understanding represents only 4.8% and Buying 7.9%. To split the two big blocks into smaller groupings, I divided them into Evaluating Us and Evaluating Others, and Exploring Broad Narrow per the description in the Intent Ontology. This is now done automatically for every data set that is passed though the python script containing the intent classification algorithm. Figure 3. Percent of Search Volume by the Full Set of Mindsets The full set of mindsets provide a more granular view of the demand for content. Evaluating Others is still a very big segment representing 49.5% of the search volume, while Evaluating Us represents 10%. Exploring Narrow and Exploring Broad represent now 20.6% and 7.3% respectively. On closer inspection, the great amount of search volume under Exploring Others is driven a by the single term “Sephora” that accounts for 39,850,000 searches per year which represents 59.6% of the Evaluating Others mindset and 29.3% of the Total Search Volume for the entire category! That is impressive. This also illustrates a typical phenomenon in search and where the majority of the volume is driven by a few keywords but there is a very long tail of other keywords driving the rest for the category. For instance “Sephora” is just is 1 term among 9,360 but represents almost 1/3 of the entire demand for content for the makeup category. This highlights the need to use the search volume as a way to weight the importance of each term in any clustering or classification system and avoid relying solely on word frequencies within the corpus. This initial Intent Segmentation already provides insight and guidance for strategy. The demand for our brands, in this example L’Oréal brands, is only 10% of the demand, so relying only on L’Oréal brands’ branded terms for search ads or search engine optimization tactics will miss most of the opportunity to capture and convert demand for the category. While Evaluating Others is a difficult place to play in, since consumers are looking specifically about information for competitors it is worth exploring opportunities for conquesting on specific branded terms from competitors or in the case of Sephora, making sure the brand is sold in that particular retailer and L’Oréal brands are clearly associated with Sephora. The next best place to search
  • 8.
    for opportunities toincrease presence via either paid ads or organic search results is in the Exploring mindset. Exploring Broad are mostly terms such as “makeup brushes”, “eyeshadow palette”, “make up kit”, etc. These are good terms to strategize on but also tend to be very expensive, so the next best bucket is in the Exploring Narrow. Which contain a much varied set of terms such as “best mascara”, “eye shadow looks”, “best foundation for oily skin”, “makeup for hooded eyes”, etc. A more detailed view of the different topics under each Mindset is warranted and so a semantic similarity clustering method that is crossed with the intent similarity method would be of great use. Clustering Based on Semantic Similarity A deep exploration of methods of Natural Language Processing was performed to arrive at the optimal solution. The full exploration of methods is presented in the appendix. The following section briefly describe the approaches and key learnings. Method Exploration Summary For the semantic similarity, I used two types of pre-processing methods to tokenize the search queries: one with heavy pre-processing using stemming, removing punctuation and stop words, the other with minimal preprocessing simply tokenizing the search query and lowercasing the words. For vectorization of the search queries, I initially used TFIDF, Doc2Vec and Word2Vec trained specifically on the corpus. For Word2Vec I averaged the vectors of each word in the search query to get to a single vector per search query. I then used the k-means method to cluster the search queries and compared approaches. I used the elbow method to identify the optimal number of groups with each method and compared the results. Doc2Vec and averaging Word2Vec vectors produced unsatisfactory results suggesting too few groups, and surprisingly, the TFIDF method performed the best giving a much wider options of clustering solutions. When clustering the terms into 10 segments I got a decent classification, since the terms in clusters appeared to have similar meaning but the variation in the size of the clusters was a bit concerning. To explore other options for topic modeling I performed two additional approaches with the TFIDF matrix method. A Latent Dirichlet Allocation with 10 topics and a Latent Semantic Analysis with Truncated SVD by reducing the dimensionality of the data to 30 topics. The results from the LDA analysis were hard to interpret and the topics did not appear to be clearly differentiated. The LSA was promising but the 30 topics only represented 8.9% of the variance in the data as so these approaches were discarded.
  • 9.
    These results suggestedthat the algorithms where having a hard time identifying differences between the search queries most likely due to the lack of context and richness in the text given that the search queries are short in nature. This gave me the idea that we could increase the differences between search queries by borrowing contextual information from pre-trained word embedding. Using Pre-Trained Word Embeddings for Knowledge Transfer Transferring knowledge with pre-trained word embeddings is an ideal solution since the embeddings capture in them the context and semantic meaning of the words used in rich and dense collections of text. I chose GloVe.6B.100d because of the nature of the corpus, it was trained on, which is the entire Wikipedia corpus. This is preferable to the original Word2Vec trained on Google News because it is more likely to contain terms from my corpus such as product names like mascara, references to parts of the face such as eyelashes, or brand terms such as Sephora. Wikipedia is, in fact, a common source of knowledge and references for brands and products, so it seems ideal. I used the Glove pre-trained word embeddings to create search query vectors by averaging the vectors of the words in the search queries. There were only 17 search queries out of the 9360 queries from the entire corpus that did not contain any word included in the Glove lexicon. These search queries were eliminated from further analysis providing an effective sample of 9343 search queries (documents). Given the relative success of reducing the dimensionality of the data with Truncated SVD on the TFIDF Matrix (see appendix for more context) I reduced the dimensionality of the data from 100 (the length of the Glove vectors) to 30 by performing a PCA analysis and extracting the top 30 components. These components explain 78% of the variance and so the approach felt appropriate. An exploration of the optimal number of k to use in a k-means clustering exercise the elbow method was repeated with the reduced pre-trained GloVe embeddings. Figure 4 shows the result. The method is suggesting that 4 to 8 groups would be ideal. I decided to continue with 6 groups. Figure 5 shows the results when visualizing using the t-SNE method and two components.
  • 10.
    Figure 4. Elbowmethod for k-means with PCA reduced GloVe vectors. Figure 5. Clustering based on semantic similarity into 6 groups and using t-SNE
  • 11.
    Cluster Buying Evaluating Others Evaluating Us Exploring Broad Exploring Narrow UnderstandingTotal 0 1.4 35.2 5.9 1.3 0.2 0.1 44.3 1 6.2 4.7 0.9 1.5 0 13.5 2 0 0.5 0.1 0.5 5.5 1.8 8.6 3 0.1 2.7 0.1 2.4 5.1 1.2 11.6 4 0 3.8 1.2 2.7 4.9 1.3 14 5 0.1 2.4 1.7 0.2 3.2 0.2 8 Total_ 7.8 49.2 9.9 7.1 20.5 4.8 100 As can be seen in figure 5, the search queries appear to cluster decently into somewhat distinct clusters. The grouping is not perfect but it is showing a level of coherence that is satisfactory. As stated above, the goal is not to use these groupings by themselves but to crossing them with the Mindset segmentation using the Intent ontology. Table 1 shows the result of such crossing. The values represent the search volume generated by the queries in each bucket, not the number of search queries in each bucket. This weighted table is a better tool to identify the best way to create a set of final segments that combine intent similarity with semantic similarity. These final segments are called Intent Segments since they contain clearly defined topics with a specific Intent behind. Table 1. Intent similarity (Mindsets) and Semantic similarity (Clusters) As can be seen from the Total column on the right, Cluster 0 represents the biggest semantic cluster. This is most likely because it contains most of the branded terms such as Sephora and other big drivers of search volume. Clusters 2, 3 and 4 fall primarily under Exploring Narrow, indicating that these cluster contain different types of descriptors. Cluster 1 fall mostly under Buying, and Evaluating Others indicating that this clusters contains terms such as “promos”, “coupons” combined with branded terms, probably “Sephora”. The small discrepancies in the size of each Mindset between Table 1 and Figure 3 is due to the search queries that were eliminated during the knowledge transfer from the GloVe vectors where there were queries without word embeddings. Defining Intent Segments Up to this point, the process for performing the Demand Space Analysis would be fully automated. The analyst would then look at a table like the one above to decide which final grouping to define. The guidance would be to identify buckets where there is at least 4% of the volume and keep those as unique intent segments. The buckets with less than 4% of the search volume should be merged under each Mindset. For instance, using the above table the user would determine that under Buying, Cluster 0 should be merged with Cluster 2, 3, 4 and 5. Leaving only two segments under Buying. Cluster 2 to 5 should be merged under Evaluating Others. Cluster 1 to 5 should be merged under Evaluating Us. All the clusters under Exploring Broad should be merged. Cluster 0, 1 and 5 should be merged under Exploring Narrow. And finally, all the clusters under Understanding should be merged.
  • 12.
    Figure 6 showsthe search volume represented by the newly defined Intent Segments. And figure 7 and 8 show the top bigrams and unigrams for the major intent segments, extracted using CountVectorizer and weighted by the search volume represented from the search queries they appear on. Figure 6. Percent of Search Volume by Intent Segment As can be seen, the volume is now spread more evenly among smaller groups except for Evaluating_Others_1 which is the one containing “Sephora” and therefore it is almost impossible to split even further. Evaluating_Others_ turned out to become a big segment. All the Intent Segments ending on Segment_Name_ without a number at the end, are the ones that resulted from combining all the buckets within each Mindset that had less than 4%. This segment could have been analyzed further but was left alone for the more detailed inspection in Figures 7 and 8. Figure 7 and 8 provide a glimpse about what the segments are about. Understanding_1 is about makeup tutorials, how to draw eyes and how to apply the different products. Exploring_Broad_1 is about makeup brushes, eyeshadow palette, makeup kit, contouring, eyebrows products, etc. Evaluating_Narrow_1 is about all the best products with several mentions to the type of skin. Evaluating_Narrow_2 is about the more general makeup searches, like makeup look, best makeup. Exploring_Narrow_3 is all about foundation, best foundation, foundation coverage, best drugstore foundation. Evaluating_Others_1 is the big one where sephora takes the lead. Notice that sephora is not in the bigrams for this intent segment in Figure 6, but it is by far the bigger unigram for the segment as shown in Figure 8. bh cosmetics is also an important term in this segment. Mac cosmetics, elf cosmetics and estee lauder are key to Evaluating_Others_2. As mentioned before Sephora is a retailer where a lot of the L’Oreal brands are sold, so it is not really a competitor but a partner. While Mac Cosmetics and BH Cosmetics and Estee Lauder are direct competitors. In fact checking the terms for Buying_1 it is easy to perceive that most of the volume from this intent segment which represents consumers who are ready to buy comes from searches about near me stores for Sephora and Ulta with some mentions of gifts and Lancôme. Evaluating_Us_1 is mostly about mascara from l’oreal, Maybelline, lancome and nix cosmetics, while Evaluating_Us_2 is about foundation, lipstick, eyeliner and those same brands.
  • 13.
    Figure 7. TopBigrams by Intent Segment Figure 8. Top Unigrams by Intent Segment
  • 14.
    It is importantto note that the same term can be found in different intent segments and so a list of the top unigrams and bigrams across the entire corpus irrespective of the Intent segment would show which terms should be considered in general to guide the SEM and SEO strategy. Table 2 show the top bigrams and unigrams across all segments. Not surprisingly sephora is the dominant term, followed by makeup, best, foundation, cosmetics (from bh cosmetics and mac cosmetics primarily), near, palette, mascara and maybelline. The top bigrams give a more nuances picture where the volume from the search queries where these terms show up is far less concentrated on a few of the terms. For comparison, 91.8% of the search volume is generated by search queries that contain the top 10 unigrams vs only 18.9% by those that contain the top 10 bigrams. This means making sure we have a strong presence and content that speaks to the top 10 terms is key to be associated with the category. Table 2. Top unigrams and bigrams across all segments This top 10 terms appear in most of the search queries, which explains in part why certain algorithms struggle to differentiate between the search queries. Another evidence of the interconnectedness of all the search queries is the network of search queries that is created by measuring the cosine similarity between each pair and creating a link between queries when the cosine similarity is above 0.8. Figure 9 shows the graph for the demand space for the
  • 15.
    makeup category. Thelinks and the nodes are colored based on the communities or cluster they belong to. Figure 9. Graph of the Demand Space for the Makeup Category As can be seen, there are a few small patches of search queries that are separated from the core. The great majority of the search queries are concentrated in the middle and interconnected between them. This is further evidence that is it hard to fully separate the search queries into distinct buckets based on their semantic similarity and provide support for the need to classify them based on intent similarity as well as semantic similarity to make sense of the data and provide practical uses of the analysis. Dynamic Mindsets Over-Time One key aspect of the Demand Space Analysis is the seasonality of the different Mindsets. Since the Mindsets represent consumers’ intentions and not consumer per-se, it is expected that they will be dynamic and vary over time. To check this characteristic for the makeup category I
  • 16.
    plotted the searchvolume for each mindset indexing the volume for each month to the maximum across the entire time period analyzed. This way each mindset can be compared on the same scale and the patterns overtime become more apparent. This is how Google trends work and it is a useful tool to plan the media flighting and content production/release to better the need states and intentions of consumers in the marketplace. Figure 10 show the volume over time for the core Mindsets. Figure 10. Search Volume Over-Time by Mindset As can be seen the Buying mindset is highly seasonal, peaking during the holidays where a lot of gifting and shopping is happening. Exploring and Enjoying are more stable over time. This suggests that brands, should build the pool of potentials by engaging consumers when they are Exploring and match the flighting of media and content that speaks to those Evaluating and Buying to mirror the rise and decay on buying and evaluating behavior. To get a more granular view of the Exploring and Evaluating mindset Figure 11 and 12 show the behavior over time for all the sub segments of the mindset. Figure 11. Search Volume Over-Time by Exploring Intent-Segments
  • 17.
    There are importantnuances for the Exploring_Broad_1 Intent Segment. It has a stronger peak during the holidays than the other Exploring mindsets. Exploring_Narrow_1 and Exploring_Narrow_3 peak in March not during the holidays indicating that there is a higher demand for content during spring in preparation for the summer. These mindsets are heavily focus on foundation and mascara so brand would have to keep in mind for the content creation calendar. Figure 12. Search Volume Over-Time by Evaluating Intent-Segments There are also important nuances in the behavior over time of the Evaluating intent segments. Both segments of Evaluating Others show a similar patterns with a very high difference in search volume between the holidays and the rest of the year. Searches for retailers like Sephora are highly seasonal since consumers search for them with the clear intentions to shop and buy and this behavior is reflected on these trends. Both segments of Evaluating Us are more stable over time with some differences during the second half of the year. More research is needed to understand the potential drivers of these small yet noticeable differences. Conclusion The use of semantic similarity in combination with intent similarity appears to provide important nuances and helps us better understand the demand for content seek out by searchers as they approach a product category. The process shown here through the use of a large sample of search queries from the makeup category illustrates the value of the different NLP methods to create a pipeline of tasks and analysis that can help automated the analysis providing enormous value to the company in terms of resource utilization and time to market. This type of analysis is usually done when prospecting a new client or when entering a new
  • 18.
    category and thereis often a time constraint that makes this analysis hard to do on a consistent basis. The process outlined here where we first use the intent classification with the standard but flexible Intent Ontology using a customized Reference Term Vector followed by a semantic similarity clustering using pre-trained word embeddings has been coded in to a Python scrip and could be run with any similar data set from another category. The initial suggestion of clusters by the system should guide the analysis define the final intent Segments that best represent the search volume for the category. Once that manual definition is made, the rest is also completely automated. The system will generate the treemaps, the words clouds for each intent segment and the trend lines over time for the overall mindsets and their sub-segments. There is still an opportunity to explore other clustering algorithms other than the k-means and to better integrate the graph and network analysis into the core process. At this point the graph was used only for visualization and illustration purposes but it could become a major way to explore the data interactively and extract deeper insights. Right now the process of creating the adjacency matrix takes too much time (it took more than 5 hours to complete for the makeup example) and the actual visualization of the graph was done in a separate environment using Gephi. The computational resources needed to do this in an effective way and to provide interactivity to change, on the fly ,the similarity threshold that determines if links are created or dropped between search queries and to identify communities and explore them is beyond the simple laptop we carry in the office, so an exploration of could services to perform this task is warranted. As with many Machine Learning and Ai systems, there is room for continuous improvement and adjustments. So I take the methods and approaches learned here as the starting point and look forward to build more sophisticated NLP application with this and other type of unstructured data.
  • 19.
    APPENDIX Method Exploration Pre-processing To pre-processthe data, I used two methods. One method that tokenizes the search queries, removes punctuation, lowercases all word, removes stop words and finally reduces the individual words to their stem using the PorterStemmer. Another method uses minimal pre- processing by simply tokenizing the search query and lowercasing the words. The first method was used for all exercises using TFIDF vectorization, Doc2Vec and Word2Vec vectorization where the word and document embeddings were learned specifically for the corpus using Python’s Gensim package. The second method was used when loading pre-trained word embedding from Glove to maximize the number of words identifies in the pre-trained word embeddings lexicon. Low-level statistical NLP methods After pre-processing the data, a clustering exercise was done using the k-means method, first with the TFIDF vectorization method, then with Doc2Vec vectorization, and then with Word2Vec document vectorization by averaging the vectors of the words in each search query. To get an idea of the ideal number of clusters to use for k-means I used the elbow method. Figure 1 shows the results of the three approaches. Figure 1. Elbow method for number of clusters using K-means As can be seen, Doc2Vec produces a vector representation that is not discriminating well between search queries. The method is basically suggesting that there are only two groups of distinct search queries. This is unsatisfactory and therefore any additional exploration with
  • 20.
    doc2vec was eliminated.Averaging the learned word embeddings does a better job, but still suggests just a few groups, either 3 or 4. Using the more straightforward TFIDF method gives a much wider range of options as there is not a clear cut at any point between 1 and 20 groups. This is preferable since the solution needs enough granularity to help strategists provide specific content or keyword bid recommendations and so having just a few groups is not desirable while having too many is impractical. To further explore the solutions with averaging learned word embeddings I did a cluster analysis with 4 groups, per the suggestion in the elbow method, and calculated the size of each cluster. Table 2 shows the results. As can be seen, the method creates a big group with 40% of the search queries two smaller groups between 24% and 28% and a small group of 8%. This is further evidence that learning the word embedding from the corpus is not discriminating enough between the search queries and does not seem to be a good approach. Table 1. Cluster and size for K-means with word2vec learned embeddings. The rationale for using learning word embeddings is that the corpus contains a lot of brand terms and specific product names that are likely to be unique to the corpus. The drawback may be that many brand terms appear in the same context and so the discriminating power is therefore reduced. There might be other factors driving this lack of variation but any further analysis with learned word embedding is left for other time. To further explore the option of using the TFIDF method to represent the search queries in the corpus, I ran a cluster analysis with k=10 to get a baseline group to compare. I used 1-grams, 2 - grams and 3-grams to extract terms and vectorize the search queries. Once the clusters were created suing the full TFIDF matrix I extracted the top 10 terms per closer to get an idea of what the clusters are all about. Table 2 shows the results. The clusters are sorted from left to right by the size of the cluster and the top terms from top to bottom.
  • 21.
    Table 2. Clustersize and top 10 terms for the TFIDF method. As can be seen, the method produces a big cluster with 34% of the search queries followed by a second big mid-size cluster with 18% of the search queries and two more clusters each with around 10% of the search queries, and then 4 small clusters with around 6% of the queries and finally two tiny clusters with 3% and 1% of the queries. This is not ideal, but it is a good place to start. The big group (cluster 1) seems to be about L’Oréal, Maybelline, eyebrow, and concealer. Cluster 5 is about makeup, eye makeup, tutorial, and kits. Cluster 7 is about foundation, skin, and shade. Cluster 2 is primarily about Lancôme and mascara. Cluster 4 is about lipstick. Cluster 9 eyeliner, eyebrow pencil, and Urban Decay, the brand. Cluster 6 is about contour, contour kits, contour palette, and cream. Cluster 3 is about eyeshadow. Cluster 8 is about drugstore makeup and Cluster 0 is Ester Lauder, the brand. It is important to note that the clusters are not as independent as one may think since there are a lot of common terms among them, but they do seem to capture some of the structural patterns, and topics in the data. Further exploration is described later in the report. Topic modeling To explore other options for topic modeling I performed two additional approaches with the TFIDF matrix method. A Latent Dirichlet Allocation with 10 topics and a Latent Semantic Analysis with Truncated SVD. Latent Dirichlet Allocation Table 3 shows the topic vectors with the terms and weights for each of the 10 topics. The topics are somehow intelligible but not clear enough or distinct enough to be useful for our goal. More exploration is needed to understand the role (if any) that LDA could play in the solution. For now, we stop the exploration with LDA. Table 3. LDA topics with TFIDF matrix
  • 22.
    Latent Semantic Analysis Theuse of LSA was different. I used Truncated SVD to reduce the dimensionality of the TFIDF matrix to a matrix of 30 topics and then use t-SNE to visualize the data. One important, note is that the 30 topics created only explain 8.9% of the variation in the data so this may not be enough to be of practical use. I use t-SNE to further reduce the dimensionality to 2 component and visualize the data with a scatterplot. Figure 2 shows the results. I layered in the 10 Clusters created earlier with the full TFIDF matrix for analysis. Figure 2. t-SNE 2 main components and Clusters using 30 LSA topics extracted from the TFIDF matrix Surprisingly, the approach seems to work well despite the huge loss of information from the dimensionality reduction. There are still serious problems with, Cluster 1, Cluster 7, and Cluster 3. There appear to be 2 buckets of search queries within Cluster 7 and Cluster 3. However, the biggest problem is with Cluster 1, (the big one -34% of the search queries) where the search queries are scattered around in several small groups, except for a big group on the bottom left of the graph.
  • 23.
    It is impracticalto inspect each of the 10 Clusters in detail when each cluster is composed of thousands of search queries. So instead of manual inspection, I decided to layer in the grouping that we have used in the past to classify these search queries based on consumers’ intent. This classification is the critical layer of an Intent Ontology that will serve as the guiding principle moving forward and that will be discussed in detail in its own section later in the report. Figure 3 compares the previous t-SNE visualization by cluster with the manual coding of the search queries based on the Intent Ontology. Figure 3. Topic clusters and Intent classification There is not much correspondence between the two types of classification, so no clear insights can be drawn from the exercise, other than to note that these two classifications seem to be capturing very different aspects of the search queries, which is not surprising. It is important to note that the Intent classification is composed of 6 groups while the Cluster analysis done so far is composed of 10 groups, so no direct correspondence is expected to occur but hopefully, there is some type of relationship between the topic clusters and the intent clusters in the final solution. Knowledge (context) transfer with pre-trained word embeddings One of the problems with search queries is that they don’t have a lot of context. Many times, the search query is composed of just a few keywords like “best makeup” or “best mascara drugstore” instead of a full sentence or question. There are two challenges here: 1) there is
  • 24.
    little context todraw deeper semantic meaning and 2) the language is not really “natural”, it is the way people got trained to search by the search engines over the past 15 years. Given these challenges, I thought a good solution could be to transfer some of the lacking context and semantic meaning by using pre-trained vectors. This is an ideal solution since the embeddings capture in them the context and semantic meaning of the words used in rich and dense collections of text. I chose GloVe.6B.100d because of the nature of the corpus, it was trained on, which is the entire Wikipedia corpus. This is preferable to the original Word2Vec trained on Google News because it is more likely to contain terms from my corpus such as product names like mascara, references to parts of the face such as eyelashes, or brand terms such as Sephora. Wikipedia is, in fact, a common source of knowledge and references for brands and products, so it seems ideal. The 100-dimension vector was chosen for simplicity but future exploration with other dimensions or with other pre-trained word embeddings like FastText is desirable to optimize the solution. Having availability of multiple languages is key since this is a solution that would be ideally used at different Performics offices around the globe. We are a global company with a network of offices around the world that rely on this type of analysis of search queries. I used the Glove pre-trained word embeddings to create search query vectors by averaging the vectors of the words in the search queries. There were only 17 search queries out of the 9360 queries from the entire corpus that did not contain any word included in the Glove lexicon. These search queries were eliminated from further analysis providing an effective sample of 9343 search queries (documents). The same k-means clustering exercise performed before was repeated with the new pre- trained vectors. Figure 4 shows the result of the elbow method to identify an optimal number of clusters. Although there is not a clear cut in the graph, the method is suggesting 5 as the optimal number of clusters. However, for ease of comparison with the method using the TFIDF matrix, I decided to go with 10 clusters for now. Figure 4. Elbow Method for Identifying Optimal Number of Clusters with Glove Pre-Trained Vectors.
  • 25.
    To make adirect comparison with the TFIDF method used earlier, I extracted 1-gram and 2- gram terms for each cluster using the CountVectorizer and counted the frequency of terms in each cluster. Table 4 shows the top 10 terms in each cluster. This is not exactly the same as with the TFIDF because the individual terms are not the ones used by the clustering algorithm, so these terms are not sorted based on their proximity to the centroids of the cluster but by their frequency within the cluster. Hopefully, these terms still represent the topic of the cluster. For completeness of the comparison, I estimated the size of each cluster and sorted them from left to right. Table 4. Top terms and size of clusters using pre-trained Glove vectors. As can be seen, this method produces more balanced clusters, the first 4 contain between 17% and 12% of the search queries, the next 5 between 7% and 10% of the queries and only the last cluster (Cluster 0) contains less than 5% of the search queries. This is a much desirable solution. However, these sizes should not be taken too seriously since the real size of these segments depends on the volume of searches that each search query gets in a time period which can dramatically change the size of the segments. For instance, the top term Sephora just by itself drove 39% of the search volume from the entire corpus in November 2017. This is a single term accounting for most of the demand for content from the makeup category as represented by our corpus. So, wherever that term falls into will dramatically increase the size of the cluster. For now, the variability in the size of the clusters in Table 4 is an indication of how well the method is discriminating between the search queries. And this is an indication that is doing it well. In terms of the meaning of each cluster, the single terms do capture some of it but it is not completely clear. Cluster 1 seems to be about how to apply makeup (an important topic discovered during the manual classification). Cluster 8 is about makeup tutorials. Cluster 3 and cluster 4 seem to be about foundation but cluster 3 more about the type of skin while cluster 4 is about reviews for specific brands of foundation. Cluster 6 about eyeshadow palette and lipstick, and Cluster 0 seems to be about reviews. These are good and bad news. The clustering solution appears to capture nuances and differences between topics like the ones speculated from cluster 3 and cluster 4 but it also shows the difficulty to understand what each cluster is really all about since there are so many individual search queries in it and many share key terms
  • 26.
    that it ishard to grasp the meaning of each cluster without a labor-intense check of all the search queries in each group. For now, less focus on the ability of the method to capture the inner structure of the data and provide a good starting point. The next couple of charts show precisely that. Figure 5 repeats the analysis of using t-SNE to visualize the data by extracting the two main components and layering the clusters as well as the intent classification. To make a clearer comparison with the method using the TFIDF method and Truncated SVD, I reduced the dimensionality of the data from 100 (the length of the Glove vectors) to 30 by performing a PCA analysis and extracting the top 30 components. These components explain 78% of the variance, much better than the 8% from the Truncated SVD approach! Figure 5. Topic clusters and Intent classification with Glove pre-trained vectors and PCA-30 As can be seen, the search queries are now more tightly grouped by the clusters and occupying distinct areas of the graph. There is some overlap between Cluster 8, 9 and 4 in the middle and between 5 and 7 at the top but the clusters can be easily identified in different regions. Please note that the clustering was done with the full 100-dimensional vectors, not the reduced 30 principal components used to feed the t-SNE algorithm, so this can be refined further if the clustering was done on the reduced data set. Something similar occurred with the intent classification (ontology). Most of the “Evaluating” group is in the top-right while the “Exploring Narrow” is in the bottom-left. This distinction was not guaranteed so the Glove embeddings are certainly capturing semantic similarity as well as intent similarity. This is even more evident
  • 27.
    with the “Understanding”group whose search queries are mostly grouped together in the bottom-right corner. This is encouraging. There are more overlap and correspondence between the topic clusters and the intent groupings. For instance, cluster 6 seems to fall within the “Understanding” group. Again, the hope is not to replace the Intent grouping which what seems to be a topic (semantic) grouping but to add another classification underneath it. Some correlation is expected between the topics and intent but one does not replace the other.