Leverage the voice of your customers.
An Anderson Analytics and SPSS predictive text analytics case study.
In the new information age, any customer can become a brand
enemy or evangelist and reach millions of other customers on the web.
These opinions, expressed freely on the internet, encompass attitudes,
thoughts and behaviors of past, current and potential customers.
Companies are rushing to tap into
the explosion of customer opinions
expressed online. The most innovative
companies know they could be even
more successful in meeting customer
needs, if they just understood them
better. Text analytics is proving to be
an invaluable tool in doing this.
Anderson Analytics, a full-service market research consultancy, tackles this issue using
cutting-edge text analytics and data mining software from SPSS that allows the application
of linguistic, statistical and pattern recognition techniques to extremely large text data sets.
A text analytics project is usually part of a much larger data mining project that would typically
involve the identification of some core strategic questions, the allocation of resources and
the eventual implementation of findings.
However, the focus of this case study is to describe the tactical aspects of a text analytics
project and to delineate the three basic steps involved in text analytics:
• Data Collection and Preparation
• Text Coding and Categorization
• Text Mining and Visualization
In this case study, Anderson Analytics “content-mined” data available on Flyertalk.com’s
discussion boards. Flyertalk.com is one of the most highly trafficked travel domains. It
features chat boards and discussions that cover the most up-to-date traveler information,
as well as loyalty programs for both airlines and hotels.
Note that the text analytics techniques applied in this case are not limited to discussion
boards or blogs but can be applied to any text data source, including survey open ends,
call center logs, customer complaint/suggestion databases, emails, etc.
Step 1: Data Collection & Preparation
discussion topics, topic ID, topic starter, and topic start
H aving quality data in the proper format is usually more
than half of the battle for most researchers. For those
who can gain direct access to a well-maintained cutomer
date. Then, by using the topic ID, the web-scraping
application constructs and submits query strings to the
database, the data collection and preparation process is FlyerTalk.com site to retrieve messages associated with
relatively painless. However, for researchers who want to each specific topic.
study text information that exists in a public forum such as
A good web-scraping tool should allow the capture of
FlyerTalk.com, data collection can be more complex and
information that exists in the source data of an html page,
usually involves web-scraping.
not just the displayed text. Therefore, hidden information
Web scraping (or screen such as the topic ID, date stamp, etc. also becomes
“It is crucial for us to get an scraping) is a technique used available to the researcher.
understanding of how our to extract data from websites
Besides making sure the fields in the final dataset are in
that display output generated
most loyal customers think the correct format, another problem unique to discussion
from another program. There
board text needs to be addressed. It is very common for
and what they value most are many commercially
posters to quote others’ text within their own posts. These
in their travel experience… available applications that
quotes should typically be extracted from the message field
can scrape a website and turn
any data that helps us and placed in a separate field so as to prevent double
the blogs or forum messages
counting and inadvertently weighting certain posts.
meet our loyal customers into a data table.
and attend to needs of In addition to the text messages posted on the forum, the
Even with the availability of
web-scraping process should also capture the poster’s
future customers is worth powerful web-scraping tools
ID and ‘handle’ as well as any other available poster
and techniques, text mining
its weight in gold.” a popular blog or a message
information such as forum join date and forum registration
information (in this case: location, frequent traveler
board like the one at
Robin Korman, FlyerTalk.com presents
program affiliation, etc.)
Vice President, Starwood unique data collection and
Loyalty Marketing Program processing challenges. WEB SCRAPING PROCESS
The amount of free text
available on such sites usually
Crawl the website and scrape
prohibits an indiscriminate approach to data scraping.
for topic, ID and thread initiator.
A strategy with clear objectives and a well-defined data
extraction method are needed in order to increase the reli- DOWNLOAD
ability of data analysis in the latter stages of the research. Use topic ID from the first step
as part of the URL query string
In this particular case, researchers at Anderson Analytics
to download messages.
narrowed the scope to just discussion topics within a
12-month period (from August 2005 to August 2006) STORE
on the five major forums intended for discussing the Web crawl and store message display pages.
hotel loyalty programs of Starwood, Hilton, Marriott, Inter
Continental, and Hyatt hotels.
Screen scrape stored webpages and extract data
Specific web-scraping parameters differ depending on the into a structured format.
structure of the target sites. In a discussion board format,
the text data tend to follow a simple hierarchy. Typically,
Link extracted posts with topics from the first
each forum contains a list of topics, and each topic
step, along with other extracted fields to create
consists of numerous posts. Therefore, the web-scraping
the final dataset.
process of FlyerTalk.com initially retrieves data such as the
Step 2: Text Coding & Categorization
T ext coding and categorization is the process of assign-
ing each text data record a numeric value that can be
used later for statistical analysis.
CODING & CATEGORIZATION
Text coding can apply either dichotomous codes (flags &
Use both computer and human coder to obtain
many variables) or categorical codes (one variable for an
the preliminary understanding of the data.
entire dataset). Short answers to an open-ended survey
question typically use categorical codes. However, the INITIAL CLASSIFICATION
amount of text included in most discussion board posts Use SPSS Text Mining tool to perfrom initial
typically requires dichotomous codes. categorization on a sample data set (1/100 of
the entire dataset).
Text coding is usually an iterative process. This is
particularly true for coding messages on a site such as COMPUTER CLASSIFICATION
FlyerTalk.com. This is because compared to survey Information and knowledge gained from the
answers, the text information from discussion boards tends initial concept extraction is used by human coder
to be less focused. The text data on most discussion boards to assist in computer categorization.
tend to be “user-driven” rather than “provider-driven.” CODING & CLASSIFICATION REFINEMENT
Before creating categories, researchers at Anderson Categorization and coding are an iterative
Analytics first randomly examine a sample of text messages process. Custom libraries are created to refine
to gain a basic understanding of the data. This step is the process. Text extraction is performed
required to understand the type of acronyms, shorthand multiple times until the number of and the
and terminologies commonly used on the forum of interest. details of categories are satisfactory.
SPSS Text Analysis for CODING & EXTRACTION RULES
Surveys and Text Mining Once the coding result becomes satisfactory,
for Clementine are powerful the same coding and extraction rules are used
“I strongly believe that the tools. However, the text on the entire dataset.
travel industry can not coding results can be CATEGORIZATION RESULTS
greatly improved if the Categorization results are exported for further
only learn by listening programs can be “trained” analysis with tools such as SPSS Text Analysis.
to their customers in to better understand
real-time, but that by text information particular
to the industry and topics
being active where the of interest.
customers are on the
With a list of industry
Internet, they can create specific themes, concepts
a unique generation of and words, the researchers
at Anderson use tools such
loyal, repeat customers…
as SPSS Text Analysis for
Customer service Surveys to create a custom
these days is where dictionary. Then the SPSS
text analytics applications
the customer is, not
can be used, in conjunction In this case, examples of some of the basic concepts
where you are” with an SPSS developed in the messages that can be detected by the software
dictionary, to extract highly include: ‘rates’, ‘stay’, ‘breakfast’, ‘points’, ‘free offers’. The
Randy Petersen, relevant concepts from text extraction and categorization processes are repeated
CEO WebFlyer.com the text data. with minor modification each time to fine tune results.
Step 3: Text Mining & Visualization
T he coded text data can be interpreted in many
different ways depending on the needs of any given
research project. In this case, the data is examined
“Understanding the needs of your customers as
well as the strengths and weaknesses of your
via the following methods: competitors is paramount for building brands
Positive/Negative comments and overlapping terms that people love. Thus, I believe that the mining
The Flytalk.com data indicate that negative discussions of customer comments on the web will become
among the posters are centered on the payment process, a cornerstone for future innovation and the
condition/quality of the bathroom, furniture, and the check
in/out process. The praises seem to be centered around detection of competitive threats.”
topics such as spa facility, complimentary breakfasts,
points and promotions. Isaac Collazo,
Vice President, Performance Strategy & Planning
Data patterns within different hotel brands InterContinental Hotels Group
By comparing the coded text data of Starwood and Hilton
forums, the researchers find that the posters seem to be rela-
tively more pleased with beds on Starwood’s board, but more
pleased with food and health club facilities on Hilton’s board.
Longitudinal data patterns CONCLUSION
As this study contains data from a one year period, data Companies have found that they can compete far
can be analyzed to understand how topics are being more effectively if they gain a true, 360º view of their
discussed on a month-to-month basis. The data in this customers. The feedback that current and potential
particular case revealed that the discussion about “promo- customers provide in blogs, forums and other online
tions” on the Starwood board was particularly frequent spaces provides a rich source of feedback. Using
in February 2006. Cross-checking with Starwood manage- text analytics to monitor this information helps
ment confirmed that special promotions were launched organizations gauge customer reaction to products
during that time period. This demonstrates one way to and services and, when combined with analysis of
measure the impact of various communication strategies, “structured” transactional data, delivers predictive
promotions and even non-planned external events. insight into customer behavior.
Analysis of Poster Groups This paper described how text analytics was applied to
Web mining may be helpful in understanding the aggregate information posted by users of travel and hospitality
motivation of some of the most active users of the products. services; but the same techniques can be applied to
Though it may be difficult to segment posters with only other industries. A company might find, for example,
one post, frequent posters can provide a relatively rich that when it launches a special promotion, customers
set of segmentation variables. In this case, some general mention the offer frequently in their online posts.
motivational themes found were the need for being ‘in the
Text analytics can help identify this increase, as well
know’, ‘finding deals’, and the desire to “give back”.
as the ratio of positive/negative posts relating to the
promotion. It can be a powerful validation tool to
ANALYSIS complement other primary and secondary customer
Super Posters – research and feedback management initiatives.
Connectors vs. Loyalists. Companies that improve their ability to navigate and
Analysis using text mine the boards and blogs relevant to their
Clementine’s Web industry are likely to gain a considerable information
Visualization Tool advantage over their competitors.
Find out how you too can harness
the power of text analytics.
Contact Anderson Analytics & SPSS.
About Anderson Analytics
More than Market Research, Anderson Analytics is a next generation marketing consultancy that combines new
technologies, such as data and text mining, with traditional market research. We focus on helping clients Gain the
Information Advantage by combining the efficiencies and business experience found in large research firms with the
rigorous methodological understanding from academia and the creativity found only in smaller firms. Our clients put their
customers first and so do we, visit our website to learn about “The AA-Assurance.”
About SPSS Inc:
SPSS Inc. (Nasdaq: SPSS) is a leading global provider of predictive analytics software and solutions. The company’s
predictive analytics technology improves business processes by giving organizations forward visibility for decisions made
every day. By incorporating predictive analytics into their daily operations, organizations become Predictive Enterprises—
able to direct and automate decisions to meet business goals and achieve a measurable competitive advantage. More
than 250,000 public sector, academic, and commercial customers rely on SPSS technology to help increase revenue,
reduce costs, and detect and prevent fraud. Founded in 1968, SPSS is headquartered in Chicago, Illinois.
SPSS TEXT ANALYSIS FOR SURVEYS™
SPSS Text Analysis for Surveys uses natural language processing (NLP) software technologies and allows users to
combine automated and manual techniques in analyzing open-ended responses to survey questions. SPSS Text Analysis
for Surveys is uniquely able to distinguish between positive and negative comments and opinions, which is extremely
valuable in understanding customer feedback.
TEXT-MINING FOR CLEMENTINE®
Text Mining for Clementine enables users to extract key concepts, sentiments and relationships from textual or “unstruc-
tured” data and convert them to a structured format that can be used to create predictive models. This has been shown
to improve the “lift” or accuracy of predictive data models and significantly improve results.
Founded in 1986, Frequent Flyer Services has created a unique niche for itself within the travel industry as a company
that conceives, develops and markets products and services exclusively for the frequent traveler. Its focus and distinctive
competency lie in the area of frequent traveler programs. Worldwide, these frequent traveler programs in the airline, hotel,
car rental and credit card industries have more than 75 million members who earn an excess of 650 billion miles per year.
Headed by Randy Petersen, who The Wall Street Journal calls, “...the most influential frequent flier in America,” Frequent
Flyer Services has had alliances with major companies such as AOL and provides, or has provided, content and information
to many of the leading sites on the Internet. The company is probably most famous for its Inside Flyer magazine, which has
grown from a simple newsletter into the leading publication in the world for members of frequent traveler programs.
Anderson Analytics, LLC SPSS Inc.
154 Cold Spring Road, Suite 80 233 S. Wacker Drive, 11th Floor
Stamford, CT 06905 Chicago, IL 60606
Tel: +1.888.891.3115 Tel: +1.312.651.3000