SlideShare a Scribd company logo
1 of 51
MIS 329 Decision Support Systems
Assignment
Nationwide Insurance Used Bl to Enhance Customer Service
Nationwide Mutual Insurance Company, headquaitered in
Columbus, Ohio, is one of the largest insurance and financial
services companies, with $23 billion in revenues and more than
$160 billion in statutory assets. It offers a comprehensive range
of products through its family of 100-plus companies with
insurance products for auto, motorcycle, boat, life, homeown-
ers, and farms. It also offers financial products and services
including annuities, mongages, mutual funds, pensions, and
investment management.
Nationwide strives to achieve greater efficiency in all
operations by managing its expenses along with its ability to
grow its revenue. It recognizes the use of its su·ategic asset of
information combined with analytics to outpace competitors in
strategic and operational decision making even in complex and
unpredictable environments.
Historically, Nationwide's business units worked inde-
pendently and with a lot of autonomy. This led to duplication of
efforts, widely dissimilar data processing environments, and
exu·eme data redundancy, resulting in higher expenses. The
situation got complicated when Nationwide pursued any merg-
ers or acquisitions.
Nationwide, using enterprise data warehouse technology from
Teradata, set out to create, from tl1e ground up, a single,
authoritative environn1ent for clean, consistent, and complete
data that can be effectively used for best-practice analytics to
make su-ategic and tactical business decisions in the areas of
customer growth, retention, product profitability, cost contain-
ment, and productivity improvements. Nationwide u-ansfonned
its siloed business units, which were supponed by stove-piped
data environments, into integrated units by using cutting-edge
analytics that work with clear, consolidated data from all of its
business units. The Teradata data warehouse at Nationwide has
grown from 400 gigabytes to more than 100 terabytes and
supports 85 percent of Nationwide's business with more than
2,500 users.
Integrated Customer Knowledge
than 48 sources into a single customer data mart to deliver a
holistic view of customers. This data mart was coupled with
Teradata's customer relationship management application to
create and manage effective customer marketing campaigns that
use behavioral analysis of customer interactions to drive
customer management actions (CMAs) for target segments.
Nationwide added more sophisticated customer analytics that
looked at customer portfolios and the effectiveness of various
marketing campaigns. This data analysis helped Nationwide to
initiate proactive customer communications around customer
lifetime events like marriage, birth ofchild, or home purchase
and had significant impact on improv- ing customer satisfaction.
Also, by integrating customer contact history, product
ownership, and payment informa- tion, Nationwide's behavioral
analytics teams further created prioritized models that could
identify which specific cus- tomer interaction was important for
a customer at any given time. This resulted in one percentage
point improvement in customer retention rates and significant
improvement in customer enthusiasm scores. Nationwide also
achieved 3 percent annual growth in incremental sales by using
CKS. There are other uses of the customer database. In one of
the initiatives, by integrating customer telephone data from
multiple systems into CKS, the relationship managers at
Nationwide tty to be proactives in contacting customers in
advance of a possible weather catastrophe, such as a hur - ricane
or flood, to provide the primary policyholder infor- mation and
explain the claims processes. These and other analytic insights
now drive Nationwide to provide extremely personal customer
service.
Financial Operations
A sinillar performance payoff from integrated information was
also noted in financial operations. Nationwide's decentralized
management style resulted in a fragmented financial report- ing
environment that included more than 14 general ledgers, 20
chans of accounts, 17 separate data repositories, 12 different
repo1ting tools, and hundreds of thousands of spreadsheets.
There was no common central view of the business, which
Nationwide's Customer Knowledge Store (CKS) m1t1at1ve
developed a customer-centric database that integrated customer,
product, and externally acquired data from more resulted in
labor-intensive slow and inaccurate reporting. About 75 percent
of the effort was spent on acquiring, clean- ing, and
consolidating and validating the data, and very little time was
spent on meaningful analysis of the data.
The Financial Performance Management initiative implemented
a new operating approach that worked on a single data and
technology architecture with a common set of systems
standardizing the process of reporting. It enabled Nationwide to
operate analytical centers of excellence with world-class
planning, capital management, risk assessment, and other
decision support capabilities that delivered timely, accurate, and
efficient accounting, reporting, and analytical services.
The data from more than 200 operational systems was sent to
the enterprise-wide data warehouse and then distrib- uted to
various applications and analytics. This resulted in a 50 percent
improvement in the monthly closing process with closing
intervals reduced from 14 days to 7 days.
Postmerger Data Integration
Nationwide's Goal State Rate Management m1t1at1ve enabled
the company to merge Allied Insurance's automobile
policy system into its existing system. Both ationwide and
Allied source systems were custom-built applications that
did not share any common values or process data in the same
manner. Nationwide's IT department decided to bring all the
data from source systems into a centralized data warehouse,
organized in an integrated fashion that resulted
in standard dimensional reporting and helped Nationwide in
performing what-if analyses. The data analysis team could
identify previously unknown potential differences in the data
environment where premiums rates were cal- culated differently
between Nationwide and Allied sides. Correcting all of these
benefited Nationwide's policyhold- ers because they were
safeguarded from experiencing wide premium rate swings.
Enhanced Reporting
Nationwide's legacy reporting system, which catered to the
needs of property and casualty business units, took weeks to
compile and deliver the needed reports to the agents.
Nationwide determined that it needed better access to sales and
policy information to reach its sales targets. It chose a single
data warehouse approach and, after careful assessment of the
needs of sales management and individual agents, selected a
business intelligence platform that would integrate dynamic
enterprise dashboards into its reporting systems, making it easy
for the agents and associates to view policy information at a
glance. The new reporting system, dubbed Revenue Connection,
also enabled users to analyze the infor- mation with a lot of
interactive and drill-down-to-details capa- bilities at various
levels that eliminated the need to generate custom ad hoc
reports. Revenue Connection virtually elimi- nated requests for
manual policy audits, resulting in huge savings in time and
money for the business and technology teams. The reports were
produced in 4 to 45 seconds, rather than days or weeks, and
productivity in some units improved by 20 to 30 percent.
Answer the following questions:
1. Why did Nationwide need an enterprise-wide data
warehouse?
2. How did integrated data drive the business value?
3. What forms of analytics are employed at Nationwide?
4. With integrated data available in an enterprise data
warehouse, what other applications could Nationwide
potentially develop?
Assignment Purpose:
The main purpose of the assignment is to enable students
describe the relationship between DW, BI, and DSS.
Assignment Guideline:
1. Use Time New Roman.
2. Use Font Size 12.
3. Use 1.15 Line Spacing.
4. Individual Assignment.
Substantive Ethics
I found this article by Mark S. Blodgett to be quite
refreshing and informative in terms of the new perspectives
being presented. In the article, the issue being presented is the
differences between ethics and law within corporate programs.
It is an interesting issue that not many seem to think about when
mentioning business rules and regulations. Moreover, ethics and
law are typically viewed as two completely separate things, but
the author digresses. Blodgett believes that in order to better
integrate ethical codes and legal terms into a corporation, both
entities should be viewed as one and the same. This is a fair
point because as mentioned in the article, ethical codes are used
by more than 90% of companies today, yet law has not really
sunken into businesses as much as it should. Also, as mentioned
in the text, legal obligations can be easily ignored by business
executives simply because they are ignorant of the laws that are
proposed. This is another huge factor as to why laws and ethics
should be two sides of the same coin and not be viewed as
differences.
It must be mentioned that I do agree with the author and
what the articles findings suggeste d. Both legal and ethical
approaches should be taken when considering corporations and
businesses in order to integrate a more fluent and
accommodable environment. Additionally, I can imagine this
study was a long and difficult one as over twenty different
compliance areas were assessed in order to compile an accurate
study. Not only that, the term frequencies needed to be
operationally defined correctly which is no easy feat.
Overall, I feel this study is a very helpful and useful one
not only for corporate business, but for anyone in the
workplace. Legal obligations must be enforced but at the same
time ethical codes must be placed so that businesses may
prosper in a healthy way.
Justice at the Millennium
This article by Colquitt et al. was very interesting and
insightful on the topic of justice and fairness. Before reading
this study, I did not even consider what defines justice or how
fairness is accounted for. The authors are correct when stating
that we only judge something as just based on past research and
experiences and I found this quite interesting. Furthermore, the
authors found research studies dating back to 1975 up to the
date of publication in order to see just how much things have
changed in terms of the workplace. When considering this, it
was a great choice to conduct this study as a meta-analysis to
see the key differences between older definitions of justice, and
a modern take on the concept. Between all this time, lots of
rules and regulations have been implemented into what defines
justice and more specifically, into the workplace. This study
mainly focuses on how justice today plays a role in an
organizational point of view rather than a courtroom, which can
relate to a lot more people. Rightfully so, the researchers
proposed three important questions to take into consideration
when analyzing all the different types of articles over the years.
Personally, I believe this meta-analysis is very important
for anyone in the workplace because the questions posed by the
researchers are prominent issues in today’s society. For
instance, an employee may have more than one boss and those
bosses may define fairness in differing ways. A study like this
may help both bosses come to a happy medium and decide on
whatever the employee has done as fair or not. Even more so,
thousands of new individuals are entering the workforce every
month and with increasing demand for jobs comes new
accommodations for what defines as just. Again, I cannot stress
enough how important this study is to those already in the
workplace or to those who are looking to make a change into
any work environment.
Journal of Intelligent & Fuzzy Systems 38 (2020) 6159–6173
6159
DOI:10.3233/JIFS-179698
IOS Press
The impact of big data market segmentation
using data mining and clustering techniques
Fahed Yosepha,b,∗ , Nurul Hashimah Ahamed Hassain Malimb,
Markku Heikkiläc,
Adrian Brezulianud, Oana Gemane and Nur Aqilah Paskhal
Rostamb
a Faculty of Social Sciences, Business and Economics, Åbo
Akademi University, Turku, Finland
bDepartment of School of Computer Sciences, Universiti Sains
Malaysia, Penang, Malaysia
cFaculty of Social Sciences, Business and Economics, Åbo
Akademi University, Turku, Finland
dFaculty of Electronics, Telecommunications and Information
Technology,
Gheorghe Asachi Technical University, Iaşi, Romania
eDepartment of Health and Human Development, Stefan cel
Mare University, Suceava, Romania
Abstract. Targeted marketing strategy is a prominent topic that
has received substantial attention from both industries and
academia. Market segmentation is a widely used approach in
investigating the heterogeneity of customer buying behavior
and profitability. It is important to note that conventional
market segmentation models in the retail industry are
predominantly
descriptive methods, lack sufficient market insights, and often
fail to identify sufficiently small segments. This study also
takes advantage of the dynamics involved in the Hadoop
distributed file system for its ability to process vast dataset.
Three
different market segmentation experiments using modified best
fit regression, i.e., Expectation-Maximization (EM) and K-
Means++ clustering algorithms were conducted and
subsequently assessed using cluster quality assessment. The
results of
this research are twofold: i) The insight on customer purchase
behavior revealed for each Customer Lifetime Value (CLTV)
segment; ii) performance of the clustering algorithm for
producing accurate market segments. The analysis indicated that
the
average lifetime of the customer was only two years, and the
churn rate was 52%. Consequently, a marketing strategy was
devised based on these results and implemented on the
departmental store sales. It was revealed in the marketing
record that
the sales growth rate up increased from 5% to 9%.
Keywords: Market segmentation, data mining, customer lifetime
value (CLTV), RFM model (recency frequency monetary)
1. Introduction is the key success to brand loyalty, repeat store
visits,
and ultimately, sales conversions. This relationship
The retail industry collects enormous volumes of has been
affected by recent economic and social. The
POS data. However, this RAW POS data has min- retail industry
is prompted to be more strategic in
imal use if it’s not properly processed to generate their planning
and to develop a deep understanding
retail insights, optimize marketing efforts and drive of its
consumers as well as their competitors. Under-
decisions. The retailer’s relationship with customers standing
customers’ behavior as well as establishing a
loyal relationship with customers has become the cen-
∗ Corresponding author. Fahed Yoseph, Faculty of Social Sci -
tral concern and strategic goal for most retailers [1]
°ences, Business and Economics, Abo Akademi University,
Turku,
interested in tracking and managing their customer
Finland, and Deparment f School of Computer Sciences, Uni -
lifetime value on a systematic basis [44]. Market seg-versiti
Sains Malaysia, 11800, Penang, Malaysia. E-mail:
[email protected] mentation is the process to divide the market
base
ISSN 1064-1246/20/$35.00 © 2020 – IOS Press and the authors.
All rights reserved
mailto:[email protected]
https://1064-1246/20/$35.00
6160 F. Yoseph et al. / The impact of big data market
segmentation using data mining and clustering techniques
of potential customers into similar or homogeneous
groups or segments that possess mutual characteris-
tics helps marketers to gather individuals with similar
choices and interests [2]. This enables retailers to
avoid selling unprofitable and irrelevant products
with regards to their marketing purpose, which will
result in better management of the available resources
through the selection of suitable market segment and
the primary focus of specific promising segments
[3–15].
Furthermore, as far as the research scope is
concerned, there has been number of studies that
examine customer purchase behavior and lifetime
value among different products based on a variety
of market segmentation with demographic variables
and characteristics. Instead of addressing individual
consumers based on their purchasing behavior, most
market segmentation studies merely considered the
overview of consumers’ historical data to produce
assumptions of what makes consumers similar to one
another. It is significant to highlight that this method
hides critical facts about individual consumers.
Among those customer lifetime value models, a
highly regarded model cited by many experts is the
Pareto/NBD Counting Your Customers proposed by
Schmittlein, Morrison, and Colombo (1987). The
model investigates customer purchase behavior in
settings where customer purchase dropout is unob-
served. However, the model is powerful for analyzing
customer purchase behavior, but it has been proven
to be empirically complex to implement due to the
computational challenges, and only a handful of
researches claim to have implemented it [44].
Based on previous studies of market segmenta-
tion on the retail domain, Recency, Frequency, and
Monetary (RFM) has been extensively employed as
this model can divide customers into groups which,
therefore, enables retailers to decide on ways to fully
utilize their limited resources in providing effective
customer service through the categorization of cus-
tomers. Nonetheless, RFM also has its own limitation
[4] where it only focuses on customers’ best scores
in addition to providing less meaningful scoring on
recency, frequency and monetary for most consumers
(Wei, Lin, and Wu, 2010). Moreover, RFM analy-
sis is not able to prospect for new customers, as it
mainly concerns the organization’s current customers
[6] and that it is not considered as a precise quan-
titative analysis model as the importance of each
RFM measure is different among other industries
[16–20]. The current research foresees an enhanced
user-friendly market segmentation modeling method,
which is more advanced and effective than conven-
tional RFM method. The integration of Customer
Lifetime Value and newly proposed RFM variants
(PQ) (T) into a closed-loop model represents dif-
ferent variation in customer purchase behavior. The
enhanced model has the capability to simultane-
ously analyze millions of raw POS data, identify
groups of customers by criteria the retailer may
never have considered. This goldmine knowledge is
expected to help marketers avoid the assumptions
when doing customer deep-dive and trend analysis,
which subsequently tapped marketers to device tar-
geted marketing campaign resulting in sales growth
and higher ROI. The RFMPQ and RFMT dataset con-
centrate on the idea of identifying the purchasing
power history of an individual customer or segment.
P variable represents the average purchasing power
per customer per all transactions, Q variable repre-
sents the average purchasing power per product, and
T represents the change of consumer buying behav-
ior or trend using change rate. The enhanced RFM
model also incorporates CLTV for predicting future
cash flow attributed to the customer’s shopping period
with the retailer [8], followed by applying a mod-
ified best-fit regression technique, and K-means++
and Expectation-Maximization (EM) clustering algo-
rithms to analyze the customer buying behavior as
well as to assess the clustering technique’s perfor-
mance using cluster quality assessment. The analysis
can also identify marketers’ area of focus and ensure
the highest quality of customer service.
2. Installing and using the microsoft word
template
Market segmentation is the process of categorizing
large homogenous market into similar or homoge-
neous smaller groups who share characteristics such
as income, shopping habits, lifestyle, age, and per-
sonality traits [9]. These segments are relevant to
marketing and sales and can be used to optimize
products, customer service and advertising to differ-
ent consumers [6]; It is seen that many companies
across the retail industry have identified customer
service as a market key differentiator and tend to
segment their customers for positive customer expe-
rience and service delivery [10]. There are three
types of market segmentation bases, namely demo-
graphic, geographic, and behavioral. Demographic
segmentation is the most commonly used variable
when segmenting a market. It has the ability to
F. Yoseph et al. / The impact of big data market segmentation
using data mining and clustering techniques 6161
provide the retailer with a clear vision with the
future advertising plans, a precise customer shop-
ping profile and focuses on the measurable factors
of consumers and their households. Furthermore, this
segment is primarily descriptive in terms of gender,
race, age, income, lifestyle, and family status [11].
In contrast, behavioral segmentation divides cus-
tomers based on their attitude toward products. Many
marketers consider behavioral variables such as occa-
sions, benefits, usage rate, customer status, readiness,
loyalty status, and attitude towards a product as the
ideal starting points for creating market segmentation
[11]. According to behavioral segmentation, con-
sumer behavior is the segmentation process based
on their evaluation and buying activities, as well as
the use and disposal of goods to recognize consumer
needs. These criteria can provide a thorough under-
standing of consumer behavior as they reason from
social psychology, anthropology, economics, sociol-
ogy, and psychology that influence consumers on
their purchasing decision of products [21–25]. To
get a sense of the overall customer lifetime value
for the customer-base, [45] proposed a framework to
integrate customers’ distribution with the iso-value
curves, by grouping customers on the basis of RFM
characteristics and to understand the factors that trig-
ger consumer’s defection. [48] proposed analytical
model for consumer engagement, related to the subse-
quent stages of the consumer life-cycle like customer
development, customer acquisition, and customer
retention. The authors concluded that the availabil-
ity of data is vital to the development of advanced
analysis in each consumer’s stage. However, sev-
eral organizational issues of analytics for consumer
engagement remain, which constitute barriers to
implementing analytics for customer engagement.
In order to solve the problems of consumer behav-
ior that evolved with time, this research examines
the behavioral, demographic segmentation model and
identifies customer behavior using model Customer
Life Time Value (CLTV) and Recency, Frequency and
Monetary (RFM) model.
2.1. Customer life time value (CLTV) model
Customer Lifetime Value (CLTV) is an important
metric to measure the total worth or profit to a busi-
ness obtained from a customer over the whole period
of their relationship with the retailer [8]. The liter-
ature defines the customers churn as the extinction
of the contract between the firm and the customer,
where customer retention refers to the collection of
activities organizations take to reduce the number of
customer’s defections.
Churn rate and retention rate critical matrix for
any company and considered primary components
of the future CLTV. Where CLTV is an estimation
of the average profit, a customer is expected to gen-
erate before he or she churn [48]. The concept of
retention and churn is often correlated with industry
life-cycle. When the industry is in the growth phase
of its life-cycle, sales increase exponentially. How-
ever, customer churn is the most challenging task for
the retailer industry. In this perspective, more insight
is needed to know the reason for customer churn in a
dynamic industry.
The three main components of CLTV are customer
acquisition, customer expansion, and customer reten-
tion [46]. Nevertheless, it is crucial to consider COGS
(Cost of Goods Sold) and acquisition cost to square
off the real CLTV. The basic model to calculate CLTV
is presented in Equation (1).
�n ptCLTV = (r)t (1)
t=1 (1 + d)t
The above CLTV formula is more of a proxy for
an average customer who stays for X period of time
and pays Y total amount of money. The t represents
a specific period of time, while (t = 1) represents the
first year, and (t = 2) denotes the second year. The n
represents the total time period the customer will stay
with the retailer before churn occurs. The r represents
the month over retention rate. Pt is the profit that
the customer/customers will contribute or generate
to the Retailer in the Period t, and finally, d refers to
the churn rate. Additionally, the customer’s loyalty
can be calculated using the Retention Rate formula,
as illustrated in Equation (2). Based on the Retention
Rate formula, CE denotes to the number of customers
at the end of each time period, where, CN is the total
number of new customers acquired in the chosen time
period, and CS denotes to the number of customers
at the start of the time period.
� �
(CE - CN)
Retention rate = × 100 (2)
CS
Management of consumer retention requires the
tools that allow decision-makers to assess the risk
of each consumer to defect and understanding the
factors that trigger consumers’ defection [47]. Cus-
tomer retention strategy also known as a loyalty
rate is the collection of activities a retailer uses to
maintain on a long-term relationship basis by engag-
ing existing customers to increase profitability by
6162 F. Yoseph et al. / The impact of big data market
segmentation using data mining and clustering techniques
Table 1
Criteria of customers in each segment
Segment Criteria of Customers
Best The average purchase amount > the total average
purchase amount
The average purchase frequency of
customer > the total average frequency
Spender The average purchase amount > the total average
purchase amount
The Average frequency of customer < the total
average frequency
Frequent The average purchase amount < the total average
purchase amount
The average frequency of customer > the total
average frequency
Uncertain The average purchase amount < the total average
purchase amount
The average frequency of customer < the total
average frequency
increasing the number of repeat customers. CLTV
represents a greater improvement compared to the tra-
ditional RFM analysis as the frequency of customer’s
purchases, and the amount of customers’ average pur-
chase is used for segmenting customers-base. CLTV
matrix classifies customer purchase behavior using
different segments, namely Best, Spender, Frequent,
and Uncertain classified by Marcus (1998). Table 1
illustrates the criteria of each segment that were clas-
sified by Marcus (1998).
2.2. Recency, frequency, and monetary (RFM)
model
RFM is a standard statistical marketing model
for customer behavior segmentation assess consumer
lifetime value. The model is very popular in the
retail industry as it groups customers based on their
shopping power history – how recently, how often,
and how much did the customer buy. RFM model
helps retailers group customers into various segments
or categories to identify customers who are more
likely to respond to marketing promotions and future
customer personalization services [17]. The R sym-
bolizes recency refers to the interval between the time
since last purchase the customer made. The F sym-
bolizes the frequency of consumer behavior in a time
period, and the M symbolizes monetary referring to
the amount of money consumption in a period [18].
Quintiles scoring is the most commonly used scor-
ing in the RFM method in arranging customers in
ascending or descending order or (Best to Worst).
Customers are grouped into five equal groups where
the best group receives the highest score of (5), and
the worst receives the lowest score of (1) [1]. The
RFM score is the weighted average of its individual
components and is calculated as portrayed in equation
3 and 4 to derive a continuous RFM Score. Finally,
these scores can be re-scaled to the 0 –1 range [17].
RFM score = (recency score ×
recency weight) + (frequency score
× frequency weigh + (monetary score
x monetary weight)) (3)
Rescaled RFM score = (RFM score
− minimum RFM Score)/(Maximum RFM
score − minimum RFM score) (4)
2.3. Market segmentation using data mining,
RFM, CLTV models and clustering
techniques
Market segmentation helps to differentiate and cus-
tomize marketing strategies into segments. Market
Segmentation is a significant key in data mining,
where data mining is used to interrogate segmenta-
tion data to create data-driven behavioral information
segments that are applied to detect meaningful pat-
terns and rules underlying consumer behavior [19].
Furthermore, [26] and [27] were among the stud-
ies that performed market segmentation using data
mining, RFM, CLTV, and clustering technique to
form a decision-making system. [28] proposed clus-
tering and profiling of customers using customer
relationship management (CRM) and RFM for rec-
ommendations were proposed. On the other hand,
data mining was conducted on historical data of cus-
tomer’s sales using the RFM model with K-Means
algorithm where results have outlined recommen-
dations to perform customer relationship strategy.
Also, f (2016) proposed a three-dimensional mar-
ket segmentation model based on customer lifetime
value, customer activity, and customer satisfaction.
For more accuracy, the author grouped customers into
several different groups. RFM, Kano, and BG/NBD
models obtained the corresponding variables.
Furthermore, the market segmentation model helps
enterprises to maximize their profits. In [29–35], cus-
tomers were classified into various clusters using
RFM technique and association rules were mined to
identify high-profit customers. RFM statistical Tech-
niques and Clustering methods for Customer Value
F. Yoseph et al. / The impact of big data market segmentation
using data mining and clustering techniques 6163
Analysis were combined in [26] for a company’s
online selling. As far as the methodology is con-
cerned, there is no standard convention for measuring
customer purchase behavior as each literature differs
in examining customer purchase behavior. Neverthe-
less, the K-Means algorithm is noted as an extensively
used clustering algorithm in previous research due
to its simplicity and speed in working with a large
amount of data [36]. Despite its strength, it has been
recorded that the K-Means method uses a random dis-
tribution for the seeding positions and does not main-
tain the same result each time that it is run [37]. New,
improved K-means algorithm called K-means++,
which uses sophisticated seeding procedure for the
initial choice of the center positions and often twice
as fast as the standard k-means. Contrastingly, Neha,
Kirti, and Kanika (2012) noted that K-means and
(EM) Expectation-Maximization algorithms are the
two most commonly used based algorithms for the
identification of growth patterns. Even though EM
is similar to the K-Means algorithm, this algorithm
is based on two different steps iterated until there
are no more changes in the current hypothesis [29].
Expectation (E) refers to computing the probability
that each datum is a member of each class. Maximiza-
tion (M) refers to altering the parameters of each class
to maximize those probabilities. Eventually, they con-
verge, although not necessarily correct. Furthermore,
EM algorithm is embedded with a significant feature
where it can be applied to problems with observed
data that provide “partial” information only [30].
Based on several comparative studies of EM and K-
Means methods [31–34], it was observed that EM
outperformed K-Means and results were improved
when they were hybridized. The current study inte-
grates two dynamics models, namely CLTV and RFM
models, with the addition of new RFM variants, i.e.,
P, Q and T to cater the weakness and inaccuracy of
consumer modeling that are caused by the limita-
tions RFM. In addition, this study applies K-means++
and Expectation Maximization (EM) clustering algo-
rithms to offer the retail industry with effective analy-
sis of customer buying behavior through the combina-
tion of customer profitability and product profitability
in creating a strategic marketing campaign as
explained previously in the introduction [38–40].
2.4. Mining big data
Data mining is the process of extracting infor-
mation from large data sets and transform it into
an understandable form for further use. Data min-
ing can be used in such a case where the database
is large, and the classification of such data is dif-
ficult [35]. The term Big data is often used for very
large databases whose size in terabytes to many PETA
bytes and it is beyond the ability of commonly used
Relational Database Management (RDBM) to pro-
cess the data within a tolerable elapsed time. Patel,
Birla, & Nair (2012) have done a lot of experiment
on the big data problem. The result was the finding
Hadoop Distributed File System (HDFS) for storage
and map-reduce method for parallel processing on a
large volume of data. However, the research in Big
Data analysis using data mining especially with clus-
tering methods is still considered to be young, and
therefore attracts many researchers to conduct fur-
ther research in this potential area [37, 38] proposed
a fast-parallel k-means clustering algorithm based on
Map Reduce, which has been widely embraced by
both academia and industry. They used to speed up,
scale-up, and size up to evaluate the performances of
their proposed algorithm. Their finding showed that
the proposed model could process very large dataset
on commodity (Low-cost) hardware effectively.
Hadoop is becoming a commodity for every data-
driven organization, where data is larger and comes
in many formats, mining and extracting intelligence
from data has always been a challenge [39]. The
new dynamic in the database has brought new chal-
lenges to the current analytical models and traditional
databases and emphasize the need for a paradigm shift
in data extraction and data analysis. Such challenges
are the performance of the data retrieval and the vari -
eties of data sources for which the format of the
relational databases may no longer be the best option.
[39] stated that Traditional database systems fall short
in handling scalability to boost the performance effi -
ciency and dealing with Big Data effectively and
thus the adoption of based systems such as Hadoop
is increasing. Hadoop is an open-source framework
for data-intensive distributed system processing of
large-scale data, based on Map Reduce programming
model and a distributed file system called Hadoop
Distributed File system (HDFS).
Map Reduce programming model is a methodol-
ogy that deals with implementation and generating
large datasets, making Hadoop the preferred as a solu-
tion to the problems in the traditional Data Mining
[41].
The main components of Hadoop are Hadoop
distributed file system (HDFS) a high bandwidth
clustered storage allows writing an application that
rapidly processes massive data in parallel, which is
6164 F. Yoseph et al. / The impact of big data market
segmentation using data mining and clustering techniques
vital for large files. Map Reduce, is the heart of
Hadoop. HDFS is high bandwidth clustered storage,
while Map-Reduce processing enormous pieces of
data and divide the input dataset into independent
smaller pieces and be distributed amongst multiple
machines referred to as nodes to parallel process them
[42].
3. Research method
The methodology of this work, which involved
Five phases, is outlined in Fig. 1. The first phase
focused on the implementation of our POS database
on a single node Hadoop distributed file system.
The experiments were performed on the following
system: Single node using Java: Hardware: Intel Core
i5, 8GB RAM, CPU 2.4 GHz Software: Java with
JDK 1.8 Hadoop implementation: Software: Ubuntu
16.10, Java with JDK 1.8, Hadoop 2.7.0. The sec-
ond phase focused on (ETL) Extract, Transform, and
Load involving data preprocessing steps, i.e., data
cleansing, features selection, and data transforma-
tion. Since the dataset that was used in the current
research was different than those in existing litera-
ture, a controlled experiment was performed where
the work of [20] was replicated as the baseline of this
research.
The hybrid approach (RFMPQ & CLTV) was
included in the third and fourth phase, where differ-
ent methods were employed stepwise. As the data was
transformed into three different variants, i.e., RFM,
PQ, and T, the first processing step differed in one of
them.
The classification was used to categorize cus-
tomer purchase behavior into CLTV matrix based
on the RFM and PQ dataset, while modified best-fit
regression was performed on the T dataset to find
the customer purchase trend (curve). Even though
[20] only employed K-means technique, the present
experiment was extended to include the utilization of
K-means++ and (EM).
Subsequently, the outputs were fed into the clus-
tering algorithms, i.e., K-means++ and EM at the
fourth phase for further demographical segmentation.
The accuracy of these clustering algorithms was mea-
sured during the final phase using the cluster quality
assessment that was introduced by Draghici & Kuklin
(2003). Additionally, the retention rate was calcu-
lated, and human judgment was also included as a
measure of the effectiveness of this method for a
marketing campaign.
Fig. 1. Methodologies for the hybrid of classification and
cluster-
ing of market segmentation.
The Methodologies for this research is illustrated
in Fig. 1.
3.1. Proposed data transformation using RFM
model, RFMPQ, RFMT
In this study we are using Apache Hadoop on
single-node Hadoop cluster using Ubuntu Linux
12.04 64 Bits Server Edition was preferred as
the operating system and KVM (virtual memory)
was selected as virtualization environment. Hadoop
(HDFS) node was accessed via Secure Shell (SSH).
In this study, no parameter or optimization adjust-
ment was made on the operating system to cause
performance improvement. This type of Hadoop
implementation serves the purpose and sufficient to
have a running Hadoop environment in order to con-
duct our experiment. The market segmentation model
uses retail POS data acquired from a medium-sized
retailer from the State of Kuwait. The POS data con-
tains Three years (2012 – 2015) of customers initial
and repeated purchases who made their purchases
at different geographical branches. Each transac-
tion represented a product purchased, with each line
consists of a cashier number, store-code, item-code,
brand- code, product (quantity) sold, product price,
date and time of the transaction, sub-total, grand total
as well as the customer’s demographic information.
F. Yoseph et al. / The impact of big data market segmentation
using data mining and clustering techniques 6165
Since the data was in separate compilations based on
the stores’ geographic locations, a common data for-
mat with consistent definitions for more descriptive
keys and fields was developed to merge the informa-
tion. In this phase, the string variables are converted to
numeric variables, and subsequently, missing values
were checked and replaced default or mean values
manually. In this research, the PQ variants will be
used to describe customer’s purchase power in differ-
ent demographic and behavioral eras and customer’s
attractiveness to a specific product and service. For
the T variant customers from the best segment are
used to identify customer purchase curve, and we cal-
ibrate the market segmentation model using repeated
transactions for 3220 customers over two years’
period.
The age attribute was grouped into four (i.e., ages
from 1–17, ages from 18–24, ages from 25–34, ages
from 35–44, ages from 45–54 and ages above 55).
The age group analysis was based on the premise
that a typical customer’s needs would change as
they age. The customer’s age was classified into
six categories, where each category was identified
using a unique number. Category 1 = (1–17), Cat-
egory 2 = (18–24), Category 3 = (25–34), Category
4 = (35–44), Category 5 = (45–54), and senior Cate-
gory 6 = (55 +). The Gender attribute was encoded as
1 for Male customers, 2 for Female customers, and 3
for Companies. Furthermore, the demographic con-
cept hierarchy method such as city and country was
replaced by higher-level concept nationality. Citizens
of Middle Eastern nationalities, Asian nationalities,
USA, and Canada, were assigned unique numeric
(binary) value. Other nationalities were grouped
based on continents, namely Europeans and Africans.
One exception was made for British nationalities
due to the high volume of purchases. To ensure the
maximum accuracy of RFM scores, the values of
five-dimension attributes from the POS Data were
necessary.
The attributes are described in Table 2 as follows:
The next step involved the calculation of RFM
scores as well as the newly proposed variation PQ
and T. The implementation of CLVT, retention rate,
and RFMPQ and T are developed using advanced
PL/SQL programming language. It must be noted
that the RFMPQ score refers to the weighted aver-
age of its individual components in which the scoring
analysis typically involves grouping customers into
equal buckets (quantiles) sizes. As far as this study is
concerned, the grouping procedure was applied inde-
pendently to the five RFMPQ component measures.
Table 2
Attributes of RFMPQ
CUSTOMER ID Customer unique identifier used to
capture customer’s related
information.
TRX DATE: Transaction date used to capture
customer’s Recency (R).
TXH COUNT Number of Transaction used to capture
the Frequency (F) number of each
transaction made by a customer.
TRX TOTAL SALE The total amount of each transaction
used to capture the Monetary (M)
value made by the customer.
TRXUNIT PRICE The average purchase power (Monetary)
used to capture Average Monetary (P)
per customer.
TRX QTY The average purchase power (quantity -
Q) used to capture the Average Items
purchased per customer.
Customers were grouped according to the respec-
tive measure into classes of equal sizes. The derived
R, F, M, P, and Q groups became the components
for the RFMPQ cell assignment. RFMPQ groups are
aggregated, with appropriate client-defined weights,
and the scores were the weighted average of its com-
ponents.
The next step involved where values and scores
of RFMPQ variables were determined and used as
inputs of clustering algorithms. According to this
research and the client’s requests, RFMPQ variables
had equal weights (1:1:1).
The second variant proposed in this research was
T variable, which represented the trend of customer
purchase behavior using change rate. This study pro-
poses a combination of two analytical data mining
steps. For finding T, the change in consumer purchase
behavior trend, it uses an enhanced best-fit regression
algorithm. Then T dataset is then put into the unsu-
pervised clustering algorithms, to split the consumers
into different groups based on patterns dissimilarities.
The variable T should answer a very important ques-
tion if the consumer is at high risk of shifting his or
her service to another retailer. One of the most com-
mon indicators of high-risk consumers is a drop off
in purchase power and a decrease of visits. One of the
most common indicators of high-risk consumers is a
drop off in purchase power and a decrease of visits.
Major limitations of market segmentation and RFM
models is ignoring the behavioral changes of con-
sumers during the time period of analysis. Although
the recency variable acted as one of the indicators
of such consumer behavior, it was affected by cus-
tomers’ transient behavior, and it was only based
6166 F. Yoseph et al. / The impact of big data market
segmentation using data mining and clustering techniques
on the last purchase date. Therefore, introducing a
new analysis variable was essential for the retailer to
narrow down high-risk consumers.
Based on the perspective of a retailer, average val -
ues change according to customers’ purchase and
their satisfaction with the retailer in terms of price
and services. If these average purchase amounts of
a customer decrease continuously, it could be con-
cluded that this customer is or was on the verge of
shifting his business to another retailer. A similar
conclusion can also be derived from a customer who
had an increased average purchase value, showing
that the customer was very profitable and should be
ranked accordingly. Therefore, these customers with
a variety must be treated accordingly. A computable
parameter was developed through the introduction of
some advanced programming. Firstly, the purchase
amounts of all customers in each selected time period
of analysis were required.
3.2. Market segmentation using K-means++ and
EM clustering methods
The drawback of the classic k-means algorithm is
that the user needs to define the centroid point and
offers no accuracy guarantees. This has become more
critical when clustering documentation because each
center point is represented by a word and the distance
calculation between words is not a trivial task. To
overcome this problem, a k-means++ was introduced
in order to find a good initial center point. K-means++
is a simple probabilistic means of initializing k-
means clustering that not only has the best known
theoretical guarantees on the expected outcome qual-
ity but works very well in practice. According to
[43–51] k-means++ algorithm is another variation of
the k-means algorithm, a new approach to select ini-
tial cluster centers by random starting centers with
specific probabilities are used. The steps used in
this algorithm are described below. In this regard,
the essential component required is the preserva-
tion of the diversity of seeds while ensuring that
the outliers remain robust. The primary concern of
the k-means problem is to identify cluster centers
that minimize intra-class variance by reducing the
distances from each clustered data point. This can
be achieved through an effective and well-designed
cluster-initialization technique. k-means++ was pro-
posed in 2007 by [42] for choosing initial values
(seeds) for the classic k-means clustering algorithm
to avoid poor results found by the k-means clustering
algorithm. K-means++ algorithm initializes means
more intelligently so that there is a distribution of
cluster means that is roughly even relative to the data.
K-means++ accomplishes this by selecting the first
cluster center at random and then drawing the next
sample from a distribution that puts a heavy proba-
bility weight where there are data and no close-by
cluster center. The execution of K-Means++ and EM
algorithms is carried out using WEKA tools, then
the generated results are exported into excel for easy
comparative analysis. We evaluate the performance
of the market basket based on the mean calculated
across three years forecast customers’ sales transac-
tion. The K-means++ algorithm equation’s method
is explained below. The K-means++ algorithm comes
with a theoretical guarantee to find a solution that is O
(log k) competitive to the optimal k-means solution.
It is also fairly simple to describe and implement.
Expectation-maximization (EM) algorithm is an
iterative estimation algorithm, a method similar to the
K-Means algorithm introduced by Dempster, Laird,
& Rubin (1977). The Expectation-Maximization
algorithm is an important tool of a statistical and pow-
erful method for obtaining the maximum likelihood
estimation of the parameters of an underlying distri-
bution when data contains null and missing values
to generate an accurate hypothesis. There are three
steps involved in EM technique, and the first was
the EM clustering initialization. Every class j, of M
classes (clusters), is formed by a vector parameter
(θ), composed by two parameters the mean (�j) and
the covariance matrix (Pj) which defined the Gaus-
sian probability normal distribution as features used
to describe or classify the observed and unobserved
entities of the data set x. The Expectation Maximiza-
tion (EM) algorithm was aimed to approximate the
parameter vector (θ) of the real distribution. Clus-
ter Convergence was the third step in M-step. After
every iteration was performed, a convergence inspec-
tion was conducted to verify whether the difference of
the attributes vector of an iteration to the previous iter -
ation was smaller than an acceptable error tolerance
given by parameter.
3.3. Quality assessment
One type of cluster quality assessment as suggested
in [41] was performed in this study is to compare
between the size (diameter) of the clusters versus the
distance to the nearest cluster, the inter-cluster dis-
tance versus the size of the cluster is conducted in this
study. This process can also be understood in terms
of the distance between members of each cluster and
F. Yoseph et al. / The impact of big data market segmentation
using data mining and clustering techniques 6167
Fig. 2. Cluster quality accessed by ratio of distance to the
nearest
cluster and cluster diameter: source (Draghici, 2003).
the center of the cluster, and the diameter (size) of
the smallest sphere containing the cluster. If the inter-
cluster distance was much larger than the size of the
clusters, then the clustering algorithm is considered
to be trustworthy. The formula is explained as; min
(Dij)/max (di) where Dij was the distance between
cluster I and cluster j, where both i and j were from
(1 to 6) and di is the diameter.
Figure 2 shows the quality can be assessed sim-
ply by looking at the cluster diameter. Therefore, the
cluster can be formed by heuristic even when there
is no similarity between clustered patterns. This is
occurring because the algorithm forces K clusters to
be created.
4. Results and discussion on RFM and
RFMPQ dataset
The main objective driven by the customer pyramid
classification is to determine the segment in which
customers spend more with the retailer over a period
of time and also the segment that is less costly to
maintain. The distribution of customers after using
CLTV matrix on the RFMPQ dataset is illustrated
in Fig. 3 with respect to their frequency and average
monetary values. The findings generated by customer
value matrix classification revealed that there were
7024 customers from who were categorized under
the Best segment, 25153 customers fell under the
Spender segment, 39107 customers were grouped in
the Frequent segment, and 39624 customers were
classified in the Uncertain segment. These findings
indicated that most customers were shoppers with a
limited and high budget.
Based on the results in Fig. 4, it can be observed
that the analysis of K-means++ and EM clustering
algorithms illustrated that both algorithms are agree-
able on the gender and age segments. Cluster 1 is
Fig. 3. Effective customer pyramid classification.
the most beneficial segment of customers. The best
spenders in cluster 1 are predominantly females from
the age group (25–34) and (35–44). The cluster qual-
ity assessment of RFM dataset is shown in Table 3,
clearly shows the K-means++ algorithm with the size
of cluster 328.4529657 compare with the EM algo-
rithm 271.6329114 is far more accurate, because it’s
largest value of inter-cluster distance divided by the
size of the cluster.
Since classification, according to the CLTV matrix
was already applied prior to clustering, discussion on
the results of this experiment is divided into subsec-
tion according to the quadrant of the CLTV matrix.
The first section will elaborate on the demographic
clustering for gender and age, where summaries for
both gender and age are provided for each segment.
The following section will discuss the cluster aver-
age shopping (visits) frequency, the cluster average
monetary, and the cluster average spending per visit.
4.1. RFMPQ results using K-means++ and EM
on best, spender, frequent, and uncertain
segments
Results in Fig. 5. shows the summary for all four
segments, namely best, spender, frequent and uncer-
tain where most customers were females and their
average age according to the cluster. In general, K-
means++ is seen as the best clustering method except
in the Best segment; the accuracy is very close due to
the less number of customers. Each cluster will be dis-
cussed in further details with respect to the nationality
in the following sections.
Figure 6 shows the four clusters for Uncertai n seg-
ment as generated by K-means++ and EM Clustering.
Total customers in this segment are 39624. The anal-
ysis in both clustering algorithms showed that cluster
1 was the most beneficial cluster (segment) because
6168 F. Yoseph et al. / The impact of big data market
segmentation using data mining and clustering techniques
Fig. 4. The average of four clusters generated by the K-
means++ algorithm and EM algorithm for gender and age.
it was more superior to the other clusters in terms of
the demographic variables.
Average age results portrayed that customers were
between the age of (45–54) are the most beneficial
customers, and the average gender indicated most
customers were male customers. In terms of the
average nationality, citizens of the Philippines were
considered as the best customers in average monetary
when using K-Means++ and citizens of India were the
best customers based on EM Clustering.
4.2. Summary of the market segmentation on
RFMPQ dataset
With regards to the Market Segmentation on
RFMPQ dataset, it can be summarized that Best seg-
ment was the most profitable segment with its Total
Average Spending Per Segment of (1109.37) com-
pared to other segments. Cluster 2 was the most
profitable cluster in the Best segment with most
female customers and the age groups of (25–34) and
(35–44). Meanwhile, the result for average national-
ity portrayed that citizens of Kuwait and UAE were
the best customers in average monetary and it was dis-
covered that both K-means++ and EM showed similar
accuracy. The Spender segment was the second most
profitable segment with the Total Average Spending
Per Segment of (834.01). Cluster 4 of this segment
was the most profitable cluster with an average mon-
etary of (924.19).
Females customers, those between the ages of
(25–34) and (35–44) as well as the citizens of Kuwait
and Qatar were the best customers, and it was also
shown that K-Means++ algorithm was the most accu-
rate. Next, the Frequent segment was the third most
profitable segment with the Total Average Spending
Per Segment of (349.57), and it was noted that cluster
1 was the most profitable cluster with an average mon-
etary of (397.81). Female customers, the age group of
(35–44) and citizens of the UAE and Bahrain were the
best customers. K-means++ algorithm was regarded
as the most accurate algorithm for this cluster in Fre-
quent segment.
The Uncertain segment was the least profitable seg-
ment with its Total Average Spending Per Segment
(168.59). Cluster 1 was the least profitable clus-
ter with an average monetary of (239.14). Female
customers, those in the age group of (45–44) and cit-
izens of the Philippines and some Arab nationalities
were the best customers, and K-means++ algorithm
was the most accurate algorithm for this cluster
in Uncertain segment. Based on the comprehensive
information extracted from the RFMPQ dataset, it is
concluded that (a) the newly proposed PQ method
can fulfill both robust classification and robust seg-
mentation for market segmentation model, even in
the dataset that has noisy data. Also, it is noted that
(b) The (PQ) is the most important variable, and it has
also been proven as a crucial parameter for clustering.
4.3. Predictive customer lifetime value
Predictive CLTV Matrix projects the new cus-
tomers’ expenses over their entire lifetime with the
retailer. Most retailers measure CLTV in dollars or
based on the retailer’s local currency spent by the cus-
tomer over their entire relationship with the retailer,
i.e., from the first to last transaction. Nevertheless, it
F. Yoseph et al. / The impact of big data market segmentation
using data mining and clustering techniques 6169
Fig. 5. Summary for all segments in demographic clustering
(Gender).
Fig. 6. Four clusters generated by K-Means++ and EM
clustering for uncertain segment.
is essential to consider COGS (Cost of Goods Sold) customers’
expected lifetime and potential monetary
and acquisition cost to square off the real CLTV. value from
new purchases. Furthermore, based on the
Predictive Customer Lifetime Value (CLTV): Estima- clients’
request, the static discount rate is set 25% in,
tion of the customer’s future value also considers the of which
25% is minced from the total sales. The
6170 F. Yoseph et al. / The impact of big data market
segmentation using data mining and clustering techniques
Fig. 7. The result from the strategies implemented using ranking
matrix.
acquisition cost, which is also set based on clients’
request, which is 400 becomes the estimated figure
or percentage of how much is spent to attract new
customers.
4.4. Predictive customer lifetime value
Predictive CLTV Matrix projects the new cus-
tomers’ expenses over their entire lifetime with the
retailer. Most retailers measure CLTV in dollars or
based on the retailer’s local currency spent by the cus-
tomer over their entire relationship with the retailer,
i.e., from the first to last transaction. Nevertheless, it
is essential to consider COGS (Cost of Goods Sold)
and acquisition cost to square off the real CLTV.
Predictive Customer Lifetime Value (CLTV): Estima-
tion of the customer’s future value also considers the
customers’ expected lifetime and potential monetary
value from new purchases. Furthermore, based on the
clients’ request, the static discount rate is set 25% in,
of which 25% is minced from the total sales. The
acquisition cost, which is also set based on clients’
request, which is 400 becomes the estimated figure
or percentage of how much is spent to attract new
customers.
4.5. Customer lifetime value ranking matrix
CLTV ranking groups customers into quad-
rants according to their profitability and retention
propensity. CLTV ranking can assist marketers in
performing more effective market segmentation [17].
This is accomplished through the incorporation of
RFMPQ and RFMT values with the CLTV to rank
the total sales quarterly and yearly using the ranking
matrix analytical function, as shown in Fig. 7.
Experimental results on CLTV using Ranking
Matrix, which is presented in Fig. 7 reflects the
success of the implementation of the market segmen-
tation data mining model. Important to note, the test
POS dataset is from 2012 to 2015. The client noticed
a drop in sales and a decline in foot traffic in the sec-
ond half of 2011. However, the development of our
Market Segmentation model started in 2013, and our
market segmentation model was implemented in the
fourth quarter of 2014. The analysis indicated that
the client had lost customers and sales immediately
dropped in quarter 1 of 2015 as a direct result from
the shift of mass marketing strategy to target mar-
keting strategy using market segmentation methods.
This is considered as a healthy result due to the shift-
ing from mass marketing to more customized market
segmentation. Nonetheless, in quarter 2 and onwards,
the client has gained a small increase in sales, and a
new group of customers has been retained. Results
also revealed that the retailer started to turn profitable,
starting with quarter 3 and big growth in quarter 4 of
2015.
A consistent result was observed from each quar-
ter in 2015. Thus, the client strengthened its market
position and made slow the effects of a rapidly weak-
ening overall retail market. It is generally portrayed
that the average lifetime of the customer is only two
years, and the churn rate is 52% (see Table 6), result-
ing in the development of a comprehensive marketing
strategy. Results have shown that the incorporation of
RFMPQ model and CLTV.
Matrix was the best method to categories and
classified different consumer segments and different
potential consumer segments by long term profitabil-
ity. Each of the market segments will be further
investigated and analyzed to provide specific charac-
terizations, better understanding and identification of
opportunities, as well as the profitability and the exis-
tence of risks as each segment no longer encompasses
the entire market. Based on the segments identified,
the retailer should create a tailored campaign for each
segment and offer separate services that distinguish
F. Yoseph et al. / The impact of big data market segmentation
using data mining and clustering techniques 6171
these segments to maximize responses according to
each segment’s behavior. Finally, based on the exper-
iment results, it can be positively concluded that the
newly proposed RFM (PQ) and (T) were far more
efficient than the traditional RFM in classifying and
clustering the segmentation of customer value via
RFM attributes. Kindly note that the implementation
of this work was done in 2015; however, for some
circumstances, our documentation and publication
process was delayed for quite some time.
5. Conclusions
This study has introduced a new technique of
CLTV and proposed three new RFM variations P,
Q, and T. Firstly, the new RFM variations P, Q, and
T, which have some advantages over the standard
RFM model. The advantage of these newly pro-
posed methods is that the RFMPQ method takes into
consideration the changes in the customer’s average
shopping power over time as well as the average pur-
chase power of products purchased over time by a
given customer. “The new proposed PQ provided key
attributes describing customer’s purchase behavior in
different demographic eras. The P variant helped to
identify which segment of customers spends more
over a period of time and costs less to maintain. The
Q variant helped to identify customer’s attractiveness
to a specific product and service, and also helped
to identify best-selling products, which resulted in
increase of sales potential in terms of number of units
of best-selling products that can be sold.” The market
segments were constructed through several clustering
algorithms such as K-Means++ and EM. To convert
this idea into a computable dynamic parameter, newly
modified best-fit regression algorithm and RFMT
variable were proposed. The new modified regression
algorithm demonstrated a high degree of accuracy
in strengthening and reinforcing the effect of the
customer’s most recent T purchase while justify-
ing the importance of previous consumer’s purchase.
However, more modified regression algorithm can be
further extended to highlight key customer’s trends
using new demographic variables, like profession,
income level, spending methods, and online spend-
ing. Results of the application of the new RFM PQ T
variations have reflected their effectiveness for mar -
ket segmentation and their ability to offer the retail
industry with intelligent analysis to combine cus-
tomer and product profitability. This type of analysis
can also identify the area in which marketers should
focus their attention to as well as ensuring the highest
level of quality and customer service. Additionally,
these models help retailers identify VIP consumers
who are on the verge of shifting taken their business
to another competitor. Products that are highly prof-
itable and purchased by the most profitable segments
can also be easily identified. Results of the analysis
results as well as the marketing experts have agreed
that the classification of customer purchase behavior
using CLTV matrix against RFMPQ dataset revealed
the most accurate and crucial information about cus-
tomer purchase behavior.
Furthermore, results have also indicated that age
and gender variables provided an accurate result with
an estimated analysis accuracy of 75%. Nevertheless,
the nationality variable presented a low percentage
of accuracy in both algorithms, possibly due to the
missing and noisy data related to these variables. The
results of applying the new RFM PQ and T variations
have portrayed their effectiveness which resulted in
the sales growth rate of up to 6% for market segmen-
tation and their ability to offer the SMR industry with
intelligent analysis to combine customer and product
profitability.
Secondly, the retail is data-driven industry and pro-
cess large POS transactions Increase in the collection
of data is often seen as bottlenecks for Big Data anal-
ysis. Many retailers face the challenge of keeping
data on a platform, which gives them a single con-
sistent view. Although Hadoop is not the main focus
of this study, however, in this study, the capabilities
of Hadoop was investigated. Hadoop implementa-
tion provided a highly scalable data distribution and
lighting data analysis performance.
Finally, it has been proven that sophisticated
statistical modeling methods can provide useful
information for experts, but at the same time they
are costly and complex to implement and are likely
to present a challenge to the implementation of
marketing strategies. This study proposed a sim-
ple yet powerful approach to market segmentation.
The results have shown the model provides easy
to implement and affordable market segmentation
methodologies that deliver substantial value rela-
tive to the amount of effort involved. The analysis
results discovered from the (Best, Spender, Fre-
quent, and Uncertain) provided the client a guided
vision on how to calculate customer lifetime value
and retention.
Further search recommendation is to examine the
implementation of large POS dataset on multiple
clusters and scaling the cluster by adding extra nodes
6172 F. Yoseph et al. / The impact of big data market
segmentation using data mining and clustering techniques
across several inexpensive servers. Finally, ignor-
ing consumer behavior purchase behavior trends can
be disastrous. Further research is needed for more
Explanatory data mining model like Market Basket
Analysis to help to identify what the profile of the
future customer might look like from a product per-
spective.
Funding
This research received no external funding.
Acknowledgments
We are grateful for the support given by the Uni-
versity Sains Malaysia where Y.A.F completed his
Masters on this research, under the supervision of
N.H.A.H.M.
References
[1] A.B. Patel, M. Birla and U. Nair, Addressing Big
Data Problem Using Hadoop and Map Reduce, in 2012
Nirma University International Conference on Engineering
(NUiCONE) (2012), pp. 1–5.
[2] A. Chorianopoulos and K. Tsiptsis, Data Mining Techniques
in CRM: Inside Customer Segmentation, no. 21. John Wiley
& Sons., 2010.
[3] A. David and S. Vassilvitskii, k-means++: The advantages
of careful seeding, Proceedings of the eighteenth annual
ACM-SIAM symposium on Discrete algorithms. Society for
Industrial and Applied Mathematics, 2007.
[4] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum
Likelihood from Incomplete Data via the EM Algorithm,
J of the R Stat Soc Ser B 39(1) (1977), 1–38.
[5] A. Kak, Expectation-Maximization Algorithm for Cluster-
ing Multidimensional Numerical Data, Purdue University,
2017.
[6] A. Neha, A. Kirti and G. Kanika, Comparative Analysis of
k-means and Enhanced K-means clustering algorithm for
data mining, Int J Sci Eng Res 3(3), 2012.
[7] B. Sohrabi and A. Khanlari, Customer Lifetime Value (CLV)
Measurement Based on RFM Model, Iran Account Audit
Rev 14(47) (2007), 7–20.
[8] B. Tammo HA, et al., Analytics for customer engagement,
Journal of Service Research 13(3) (2010), 341–356.
[9] C.C. Aggarwal, Data Mining: The textbook. New York,
USA: Springer, 2015.
[10] C. Bauckhage, Lecture Notes on Data Science : k-Means
Clustering Is Gaussian Mixture Modeling, B-IT, University
of Bonn We, 2015.
[11] C. Marcus, A Practical Yet Meaningful Approach to
Customer Segmentation, J Consum Mark 15(5) (1998),
494–504.
[12] D. Arthur and S. Vassilvitskii, k-means ++ : The Advan-
tages of Careful Seeding, in SODA ’07 Proceedings of the
eighteenth annual ACM-SIAM symposium on Discrete algo-
rithms, (2007), pp. 1027–1035.
[13] D. Birant, Data Mining Using RFM Analysis Knowledge-
Oriented Applications in Data Mining, Prof Kimi InTech,
2011.
[14] D.R. Kishor and N.B. Venkateswarlu, Hybridization of
Expectation-Maximization and K-Means Algorithms for
Better Clustering Performance, Cybern Inf Technol 16(2)
(2016), 16–34.
[15] F. Jiashuang, Y. Suihuai, Y. Mingjiu, et al., Optimal
selection
of design scheme in cloud environment: A novel hybrid
approach of multi-criteria decision-making based on F-ANP
and F-QFD, DOI: 10.3233/JIFS-190630, Journal: Journal of
Intelligent & Fuzzy Systems, vol. Pre-press, no. Pre-press,
(2019), pp. 1-18.
[16] F. Peter, S. Bruce, G.S. Hardie and K. Lok Lee, Count-
ing your customers the easy way: An alternative to
the Pareto/NBD model, Marketing Science 24(2) (2005),
275–284.
[17] G. Lefait and K. Tahar, Customer Segmentation Archi-
tecture Based on Clustering Techniques, in 2010 Fourth
International Conference on Digital Society, (2010), pp.
243–248.
[18] G. Sunil, C.F. Mela and J.M. Vidal-Sanz, The Value of A
‘Free’ Customer, 2006.
[19] I. He and C. Li, The Research and Application of Customer
Segmentation on E-commerce Websites, in 2016 6th Inter-
national Conference on Digital Home (ICDH), (2016), pp.
203–208.
[20] I. Kolyshkina, E. Nankani, S.J. Simoff and S.M. Denize,
RDMA Aware Networks Programming User Manual, in
Australian & New Zealand Marketing Academy. Confer-
ence (ANZMAC), (2010), pp. 1–7.
[21] I. Yeh, K. Yang and T. Ting, Knowledge Discovery on
RFM
Model using Bernoulli Sequence, Expert Syst Appl 36(3)
(2009), 5866–5871.
[22] J.A. McCarty and M. Hastak, Segmentation Approaches in
Data-Mining: A Comparison of RFM, CHAID, and Logistic
Regression, J Bus Res 60 (2007), pp. 656–662.
[23] J.L. Schafer, Analysis of Incomplete Multivariate Data,
First
edit. USA: CRC Press LLC, 1997.
[24] J. Maryani and D. Riana, Clustering and Profiling of
Customers Using RFM For Customer Relationship Manage-
ment Recommendations, in 5th International Conference
on Cyber and IT Service Management (CITSM), (2017), pp.
1–7.
[25] J.R. Miglautsch, Thoughts on RFM scoring, J Database
Mark Cust Strateg Manag 8(1) (2000), 67–72.
[26] J. Wei, S. Lin and H. Wu, A review of the application of
RFM
model, African J Bus Manag 4(19) (2010), 4199–4206.
[27] K. Dipanjan, S. Garla and C. Goutam, Comparison of
Probabilistic-D and k-Means Clustering in Segment Pro-
files for B2B Markets SAS Global Forum 2011 Management
Comparison of Probabilistic-D and k-Means Clustering in
Segment Profiles for B2B Markets, in SAS Global Forum,
Las Vegas, NV, USA, 2011, no. April.
[28] K.R. Kashwan and C.M. Velu, Customer Segmentation
Using Clustering and Data Mining Techniques, Int J Comput
Theory Eng 5(6), 2013.
[29] M. Cleveland, N. Papadopoulos and M. Laroche, Identity,
Demographics and Consumer Behaviors International mar-
ket Segmentation Across Product Categories, Int Mark Rev
28(3) (2011), 244–266.
[30] M. Khajvand, Z. Kiyana, A. Sarah and A. Somayeh, Esti-
mating Customer Lifetime Value Based on RFM Analysis of
F. Yoseph et al. / The impact of big data market segmentation
using data mining and clustering techniques 6173
Customer Purchase Behavior : Case Study, Procedia Com-
put Sci 3(2011) (2010), 57–63.
[31] M. Rezaeinia and R. Rahmani, Recommender System
Based on Customer Segmentation (RSCS), Kybernetes
45(7) (2016), 1129–1157.
[32] M. Safari, Customer Lifetime Value to Managing
Marketing
Strategies in the Financial Services, Int Lett Soc Humanist
Sci 42 (2015), 164–173.
[33] N. Li, L. Zeng, Q. He and Z. Shi, Parallel Implementation
of Apriori Algorithm Based on MapReduce, in 2012 13th
ACIS International Conference on Software Engineering,
Artificial Intelligence, Networking and Parallel/Distributed
Computing, (2012), pp. 236–241.
[34] O. Geman, et al., book chapter: From Fuzzy Expert Sys-
tem to Artificial Neural Network: Application to Assisted
Speech Therapy, INTECH, Artificial Neural Networks.
Models and Applications, Edited by Joao Luis Garcia Rosa,
Published: October 19th 2016, DOI: 10.5772/61493, ISBN:
978-953-51-2705, 2016.
[35] O. Geman, et al., book chapter Mathematical Models Used
in Intelligent Assistive Technologies: Response Surface
Methodology in Software Tools Optimization for Medical
Rehabilitation, Intel Syst Ref Library, Vol. 170, Hari-
ton Costin et al. (Eds): Recent Advances in Intelligent
Assistive Technologies: Paradigms and Applications, 978-
3-030-30816-2, 463829 1 En, (4), 2019
[36] P.K. Srimani and M.S. Koti, Evaluation of Principal Com-
ponents Analysis (PCA) and Data Clustering Technique
(DCT) on Medical Data, Int J Knowl Eng 3(2) (2012),
202–206.
[37] P. Mishra and S. Dash, Developing RFM Model for Cus-
tomer Segmentation in Retail Industry, Int J Mark Hum
Resour Manag 1(1) (2010), 58–69.
[38] R.A.I.T. Daoud, B. Belaid, A. Abdellah and L. Rachid,
Combining RFM Model and Clustering Techniques for
Customer Value Analysis of a Company selling online,
in 2015 IEEE/ACS 12th International Conference of
Computer Systems and Applications (AICCSA), 2015,
pp. 1–6.
[39] R. Soudagar, Customer Segmentation and Strategy Defini-
tion in Segments Case Study: An Internet Service Provider
in Iran, Lulea University of technology Department, 2012.
[40] R. Sethy and M. Panda, Big Data Analysis using Hadoop
: A Survey International Journal of Advanced Research in
Big Data Analysis using Hadoop : A Survey, Int J Adv Res
Comput Sci Softw Eng 5(7), 2015.
[41] S. Chen, Cheetah : A High Performance, Custom Data
Ware-
house on Top of MapReduce, in Proceedings of the VLDB
Endowmen, (2010), pp. 1459–1468.
[42] S. Draghici and A. Kuklin, Data Analysis Tools for DNA
Microarrays, 1st editio., no. June 2003. CRC Press, 2003.
[43] S. Mohammad, S. Hosseini, A. Maleki and M.R. Gho-
lamian, Cluster Analysis using Data Mining Approach to
Develop CRM Methodology to Assess the Customer Loy-
alty, Expert Syst Appl 37(2010) (2010), 5259–5264.
[44] S. Zahra, M.A. Ghazanfar, A. Khalid, M.A. Azam, U.
Naeem and A. Prugel-bennett, Novel Centroid Selection
Approaches for KMeans-Clustering Based Recommender
Systems, Inf Sci (Ny) 320 (2015), 156–189.
[45] T.K. Hun and R. Yazdanifard, The Impact of Proper
Market-
ing Communication Channels on Consumer’s Behavior and
Segmentation Consumers, Asian J Bus Manag 2(2) (2014),
155–159.
[46] T. White, Hadoop: The Definitive Guide, Third Edit.
Sebastopol, United States: O’Reilly Media, Inc, USA, 2012.
[47] T. Upadhyay, V. Atma and D. Vishal, Customer Profiling
and
Segmentation using Data Mining Techniques, Int J Comput
Sci Commun 7(2016) (2016), 65–67.
[48] V. Vandenbulcke, F. Lecron, C. Ducarroz and F. Fouss,
Customer Segmentation Based On a Collaborative Recom-
mendation System : Application to A Mass Retail Company,
in Proceedings of the 42nd Annual Conference of the Euro-
pean Marketing Academy, 2013, pp. 1–7.
[49] V.S. Jose, M. Carl, F. Mela and S. Gupta, The value of a
free
customer. No. wb092903. Universidad Carlos III de Madrid.
Departamento de Econom ́ia de la Empresa, 2009.
[50] W. Dong, Y. Cai and T. Kwok Leung, Making EEMD more
effective in extracting bearing fault features for intelligent
bearing fault diagnosis by using blind fault component sep-
aration, DOI: 10.3233/JIFS-169523, Journal: Journal of
Intelligent & Fuzzy Systems 34(6) (2018), 3429–3441.
[51] W. Zhao, H. Ma and Q. He, Parallel K -Means Clustering
Based on MapReduce, in In IEEE International Conference
on Cloud Computing, (2009), pp. 674–679.
Copyright of Journal of Intelligent & Fuzzy Systems is the
property of IOS Press and its
content may not be copied or emailed to multiple sites or posted
to a listserv without the
copyright holder's express written permission. However, users
may print, download, or email
articles for individual use.
MIS 329 Decision Support Systems Assignment Nationwide Insur

More Related Content

More from IlonaThornburg83

Once you have identified two questions that interest you, conduct an.docx
Once you have identified two questions that interest you, conduct an.docxOnce you have identified two questions that interest you, conduct an.docx
Once you have identified two questions that interest you, conduct an.docxIlonaThornburg83
 
On December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docx
On December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docxOn December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docx
On December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docxIlonaThornburg83
 
On Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docx
On Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docxOn Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docx
On Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docxIlonaThornburg83
 
Ok so I have done all the calculations, graphs and interpritations m.docx
Ok so I have done all the calculations, graphs and interpritations m.docxOk so I have done all the calculations, graphs and interpritations m.docx
Ok so I have done all the calculations, graphs and interpritations m.docxIlonaThornburg83
 
Ok so I know this is extreme short notice but I have a final 6 page .docx
Ok so I know this is extreme short notice but I have a final 6 page .docxOk so I know this is extreme short notice but I have a final 6 page .docx
Ok so I know this is extreme short notice but I have a final 6 page .docxIlonaThornburg83
 
Offenses and Punishment. Please respond to the following Explai.docx
Offenses and Punishment. Please respond to the following Explai.docxOffenses and Punishment. Please respond to the following Explai.docx
Offenses and Punishment. Please respond to the following Explai.docxIlonaThornburg83
 
Omit all general journal entry explanations.Be sure to include c.docx
Omit all general journal entry explanations.Be sure to include c.docxOmit all general journal entry explanations.Be sure to include c.docx
Omit all general journal entry explanations.Be sure to include c.docxIlonaThornburg83
 
Offer an alternative explanation for how these patterns of criminal .docx
Offer an alternative explanation for how these patterns of criminal .docxOffer an alternative explanation for how these patterns of criminal .docx
Offer an alternative explanation for how these patterns of criminal .docxIlonaThornburg83
 
Often, as a business operates, the partners bring some of their pers.docx
Often, as a business operates, the partners bring some of their pers.docxOften, as a business operates, the partners bring some of their pers.docx
Often, as a business operates, the partners bring some of their pers.docxIlonaThornburg83
 
Of all the GNR technologies (genetic engineering, nanotechnology and.docx
Of all the GNR technologies (genetic engineering, nanotechnology and.docxOf all the GNR technologies (genetic engineering, nanotechnology and.docx
Of all the GNR technologies (genetic engineering, nanotechnology and.docxIlonaThornburg83
 
Of the five management functions, which do you expect will experienc.docx
Of the five management functions, which do you expect will experienc.docxOf the five management functions, which do you expect will experienc.docx
Of the five management functions, which do you expect will experienc.docxIlonaThornburg83
 
Of the numerous forms of communication technologies presented in thi.docx
Of the numerous forms of communication technologies presented in thi.docxOf the numerous forms of communication technologies presented in thi.docx
Of the numerous forms of communication technologies presented in thi.docxIlonaThornburg83
 
Oceanic currents are an important mechanism for the transfer of ener.docx
Oceanic currents are an important mechanism for the transfer of ener.docxOceanic currents are an important mechanism for the transfer of ener.docx
Oceanic currents are an important mechanism for the transfer of ener.docxIlonaThornburg83
 
Obtaining informed consent for services from clients is required of .docx
Obtaining informed consent for services from clients is required of .docxObtaining informed consent for services from clients is required of .docx
Obtaining informed consent for services from clients is required of .docxIlonaThornburg83
 
ObjectiveTo acquire a comprehensive understanding of the applicatio.docx
ObjectiveTo acquire a comprehensive understanding of the applicatio.docxObjectiveTo acquire a comprehensive understanding of the applicatio.docx
ObjectiveTo acquire a comprehensive understanding of the applicatio.docxIlonaThornburg83
 
Observe the block diagrams.  Place events in order of occurrence in .docx
Observe the block diagrams.  Place events in order of occurrence in .docxObserve the block diagrams.  Place events in order of occurrence in .docx
Observe the block diagrams.  Place events in order of occurrence in .docxIlonaThornburg83
 
On April 8, 2014 Microsoft ended support of its Windows XP operating.docx
On April 8, 2014 Microsoft ended support of its Windows XP operating.docxOn April 8, 2014 Microsoft ended support of its Windows XP operating.docx
On April 8, 2014 Microsoft ended support of its Windows XP operating.docxIlonaThornburg83
 
Objective the students will be able to recognize the various genr.docx
Objective the students will be able to recognize the various genr.docxObjective the students will be able to recognize the various genr.docx
Objective the students will be able to recognize the various genr.docxIlonaThornburg83
 
On a blog site, httpcuriosity.discovery.comquestionsuperhuman.docx
On a blog site, httpcuriosity.discovery.comquestionsuperhuman.docxOn a blog site, httpcuriosity.discovery.comquestionsuperhuman.docx
On a blog site, httpcuriosity.discovery.comquestionsuperhuman.docxIlonaThornburg83
 

More from IlonaThornburg83 (20)

Once you have identified two questions that interest you, conduct an.docx
Once you have identified two questions that interest you, conduct an.docxOnce you have identified two questions that interest you, conduct an.docx
Once you have identified two questions that interest you, conduct an.docx
 
On December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docx
On December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docxOn December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docx
On December 31, 2015, Ms. Levine CPA, your manager and the treasurer.docx
 
On Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docx
On Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docxOn Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docx
On Dumpster Diving” by Lars Eighner (50 Essays, p. 139-15.docx
 
Ok so I have done all the calculations, graphs and interpritations m.docx
Ok so I have done all the calculations, graphs and interpritations m.docxOk so I have done all the calculations, graphs and interpritations m.docx
Ok so I have done all the calculations, graphs and interpritations m.docx
 
Ok so I know this is extreme short notice but I have a final 6 page .docx
Ok so I know this is extreme short notice but I have a final 6 page .docxOk so I know this is extreme short notice but I have a final 6 page .docx
Ok so I know this is extreme short notice but I have a final 6 page .docx
 
Offenses and Punishment. Please respond to the following Explai.docx
Offenses and Punishment. Please respond to the following Explai.docxOffenses and Punishment. Please respond to the following Explai.docx
Offenses and Punishment. Please respond to the following Explai.docx
 
Omit all general journal entry explanations.Be sure to include c.docx
Omit all general journal entry explanations.Be sure to include c.docxOmit all general journal entry explanations.Be sure to include c.docx
Omit all general journal entry explanations.Be sure to include c.docx
 
Offer an alternative explanation for how these patterns of criminal .docx
Offer an alternative explanation for how these patterns of criminal .docxOffer an alternative explanation for how these patterns of criminal .docx
Offer an alternative explanation for how these patterns of criminal .docx
 
Often, as a business operates, the partners bring some of their pers.docx
Often, as a business operates, the partners bring some of their pers.docxOften, as a business operates, the partners bring some of their pers.docx
Often, as a business operates, the partners bring some of their pers.docx
 
Of all the GNR technologies (genetic engineering, nanotechnology and.docx
Of all the GNR technologies (genetic engineering, nanotechnology and.docxOf all the GNR technologies (genetic engineering, nanotechnology and.docx
Of all the GNR technologies (genetic engineering, nanotechnology and.docx
 
Of the five management functions, which do you expect will experienc.docx
Of the five management functions, which do you expect will experienc.docxOf the five management functions, which do you expect will experienc.docx
Of the five management functions, which do you expect will experienc.docx
 
Of the numerous forms of communication technologies presented in thi.docx
Of the numerous forms of communication technologies presented in thi.docxOf the numerous forms of communication technologies presented in thi.docx
Of the numerous forms of communication technologies presented in thi.docx
 
Oceanic currents are an important mechanism for the transfer of ener.docx
Oceanic currents are an important mechanism for the transfer of ener.docxOceanic currents are an important mechanism for the transfer of ener.docx
Oceanic currents are an important mechanism for the transfer of ener.docx
 
Obtaining informed consent for services from clients is required of .docx
Obtaining informed consent for services from clients is required of .docxObtaining informed consent for services from clients is required of .docx
Obtaining informed consent for services from clients is required of .docx
 
ObjectiveTo acquire a comprehensive understanding of the applicatio.docx
ObjectiveTo acquire a comprehensive understanding of the applicatio.docxObjectiveTo acquire a comprehensive understanding of the applicatio.docx
ObjectiveTo acquire a comprehensive understanding of the applicatio.docx
 
Observe the block diagrams.  Place events in order of occurrence in .docx
Observe the block diagrams.  Place events in order of occurrence in .docxObserve the block diagrams.  Place events in order of occurrence in .docx
Observe the block diagrams.  Place events in order of occurrence in .docx
 
on 1.docx
on 1.docxon 1.docx
on 1.docx
 
On April 8, 2014 Microsoft ended support of its Windows XP operating.docx
On April 8, 2014 Microsoft ended support of its Windows XP operating.docxOn April 8, 2014 Microsoft ended support of its Windows XP operating.docx
On April 8, 2014 Microsoft ended support of its Windows XP operating.docx
 
Objective the students will be able to recognize the various genr.docx
Objective the students will be able to recognize the various genr.docxObjective the students will be able to recognize the various genr.docx
Objective the students will be able to recognize the various genr.docx
 
On a blog site, httpcuriosity.discovery.comquestionsuperhuman.docx
On a blog site, httpcuriosity.discovery.comquestionsuperhuman.docxOn a blog site, httpcuriosity.discovery.comquestionsuperhuman.docx
On a blog site, httpcuriosity.discovery.comquestionsuperhuman.docx
 

MIS 329 Decision Support Systems Assignment Nationwide Insur

  • 1. MIS 329 Decision Support Systems Assignment Nationwide Insurance Used Bl to Enhance Customer Service Nationwide Mutual Insurance Company, headquaitered in Columbus, Ohio, is one of the largest insurance and financial services companies, with $23 billion in revenues and more than $160 billion in statutory assets. It offers a comprehensive range of products through its family of 100-plus companies with insurance products for auto, motorcycle, boat, life, homeown- ers, and farms. It also offers financial products and services including annuities, mongages, mutual funds, pensions, and investment management. Nationwide strives to achieve greater efficiency in all operations by managing its expenses along with its ability to grow its revenue. It recognizes the use of its su·ategic asset of information combined with analytics to outpace competitors in strategic and operational decision making even in complex and unpredictable environments. Historically, Nationwide's business units worked inde- pendently and with a lot of autonomy. This led to duplication of efforts, widely dissimilar data processing environments, and exu·eme data redundancy, resulting in higher expenses. The situation got complicated when Nationwide pursued any merg- ers or acquisitions. Nationwide, using enterprise data warehouse technology from Teradata, set out to create, from tl1e ground up, a single, authoritative environn1ent for clean, consistent, and complete data that can be effectively used for best-practice analytics to make su-ategic and tactical business decisions in the areas of customer growth, retention, product profitability, cost contain- ment, and productivity improvements. Nationwide u-ansfonned its siloed business units, which were supponed by stove-piped data environments, into integrated units by using cutting-edge
  • 2. analytics that work with clear, consolidated data from all of its business units. The Teradata data warehouse at Nationwide has grown from 400 gigabytes to more than 100 terabytes and supports 85 percent of Nationwide's business with more than 2,500 users. Integrated Customer Knowledge than 48 sources into a single customer data mart to deliver a holistic view of customers. This data mart was coupled with Teradata's customer relationship management application to create and manage effective customer marketing campaigns that use behavioral analysis of customer interactions to drive customer management actions (CMAs) for target segments. Nationwide added more sophisticated customer analytics that looked at customer portfolios and the effectiveness of various marketing campaigns. This data analysis helped Nationwide to initiate proactive customer communications around customer lifetime events like marriage, birth ofchild, or home purchase and had significant impact on improv- ing customer satisfaction. Also, by integrating customer contact history, product ownership, and payment informa- tion, Nationwide's behavioral analytics teams further created prioritized models that could identify which specific cus- tomer interaction was important for a customer at any given time. This resulted in one percentage point improvement in customer retention rates and significant improvement in customer enthusiasm scores. Nationwide also achieved 3 percent annual growth in incremental sales by using CKS. There are other uses of the customer database. In one of the initiatives, by integrating customer telephone data from multiple systems into CKS, the relationship managers at Nationwide tty to be proactives in contacting customers in advance of a possible weather catastrophe, such as a hur - ricane or flood, to provide the primary policyholder infor- mation and explain the claims processes. These and other analytic insights now drive Nationwide to provide extremely personal customer service. Financial Operations
  • 3. A sinillar performance payoff from integrated information was also noted in financial operations. Nationwide's decentralized management style resulted in a fragmented financial report- ing environment that included more than 14 general ledgers, 20 chans of accounts, 17 separate data repositories, 12 different repo1ting tools, and hundreds of thousands of spreadsheets. There was no common central view of the business, which Nationwide's Customer Knowledge Store (CKS) m1t1at1ve developed a customer-centric database that integrated customer, product, and externally acquired data from more resulted in labor-intensive slow and inaccurate reporting. About 75 percent of the effort was spent on acquiring, clean- ing, and consolidating and validating the data, and very little time was spent on meaningful analysis of the data. The Financial Performance Management initiative implemented a new operating approach that worked on a single data and technology architecture with a common set of systems standardizing the process of reporting. It enabled Nationwide to operate analytical centers of excellence with world-class planning, capital management, risk assessment, and other decision support capabilities that delivered timely, accurate, and efficient accounting, reporting, and analytical services. The data from more than 200 operational systems was sent to the enterprise-wide data warehouse and then distrib- uted to various applications and analytics. This resulted in a 50 percent improvement in the monthly closing process with closing intervals reduced from 14 days to 7 days. Postmerger Data Integration Nationwide's Goal State Rate Management m1t1at1ve enabled the company to merge Allied Insurance's automobile policy system into its existing system. Both ationwide and Allied source systems were custom-built applications that did not share any common values or process data in the same manner. Nationwide's IT department decided to bring all the data from source systems into a centralized data warehouse, organized in an integrated fashion that resulted
  • 4. in standard dimensional reporting and helped Nationwide in performing what-if analyses. The data analysis team could identify previously unknown potential differences in the data environment where premiums rates were cal- culated differently between Nationwide and Allied sides. Correcting all of these benefited Nationwide's policyhold- ers because they were safeguarded from experiencing wide premium rate swings. Enhanced Reporting Nationwide's legacy reporting system, which catered to the needs of property and casualty business units, took weeks to compile and deliver the needed reports to the agents. Nationwide determined that it needed better access to sales and policy information to reach its sales targets. It chose a single data warehouse approach and, after careful assessment of the needs of sales management and individual agents, selected a business intelligence platform that would integrate dynamic enterprise dashboards into its reporting systems, making it easy for the agents and associates to view policy information at a glance. The new reporting system, dubbed Revenue Connection, also enabled users to analyze the infor- mation with a lot of interactive and drill-down-to-details capa- bilities at various levels that eliminated the need to generate custom ad hoc reports. Revenue Connection virtually elimi- nated requests for manual policy audits, resulting in huge savings in time and money for the business and technology teams. The reports were produced in 4 to 45 seconds, rather than days or weeks, and productivity in some units improved by 20 to 30 percent. Answer the following questions: 1. Why did Nationwide need an enterprise-wide data warehouse? 2. How did integrated data drive the business value? 3. What forms of analytics are employed at Nationwide? 4. With integrated data available in an enterprise data warehouse, what other applications could Nationwide potentially develop? Assignment Purpose:
  • 5. The main purpose of the assignment is to enable students describe the relationship between DW, BI, and DSS. Assignment Guideline: 1. Use Time New Roman. 2. Use Font Size 12. 3. Use 1.15 Line Spacing. 4. Individual Assignment. Substantive Ethics I found this article by Mark S. Blodgett to be quite refreshing and informative in terms of the new perspectives being presented. In the article, the issue being presented is the differences between ethics and law within corporate programs. It is an interesting issue that not many seem to think about when mentioning business rules and regulations. Moreover, ethics and law are typically viewed as two completely separate things, but the author digresses. Blodgett believes that in order to better integrate ethical codes and legal terms into a corporation, both entities should be viewed as one and the same. This is a fair point because as mentioned in the article, ethical codes are used by more than 90% of companies today, yet law has not really sunken into businesses as much as it should. Also, as mentioned in the text, legal obligations can be easily ignored by business executives simply because they are ignorant of the laws that are proposed. This is another huge factor as to why laws and ethics should be two sides of the same coin and not be viewed as differences. It must be mentioned that I do agree with the author and what the articles findings suggeste d. Both legal and ethical approaches should be taken when considering corporations and businesses in order to integrate a more fluent and
  • 6. accommodable environment. Additionally, I can imagine this study was a long and difficult one as over twenty different compliance areas were assessed in order to compile an accurate study. Not only that, the term frequencies needed to be operationally defined correctly which is no easy feat. Overall, I feel this study is a very helpful and useful one not only for corporate business, but for anyone in the workplace. Legal obligations must be enforced but at the same time ethical codes must be placed so that businesses may prosper in a healthy way. Justice at the Millennium This article by Colquitt et al. was very interesting and insightful on the topic of justice and fairness. Before reading this study, I did not even consider what defines justice or how fairness is accounted for. The authors are correct when stating that we only judge something as just based on past research and experiences and I found this quite interesting. Furthermore, the authors found research studies dating back to 1975 up to the date of publication in order to see just how much things have changed in terms of the workplace. When considering this, it was a great choice to conduct this study as a meta-analysis to see the key differences between older definitions of justice, and a modern take on the concept. Between all this time, lots of rules and regulations have been implemented into what defines justice and more specifically, into the workplace. This study mainly focuses on how justice today plays a role in an organizational point of view rather than a courtroom, which can relate to a lot more people. Rightfully so, the researchers proposed three important questions to take into consideration when analyzing all the different types of articles over the years. Personally, I believe this meta-analysis is very important for anyone in the workplace because the questions posed by the researchers are prominent issues in today’s society. For instance, an employee may have more than one boss and those bosses may define fairness in differing ways. A study like this may help both bosses come to a happy medium and decide on
  • 7. whatever the employee has done as fair or not. Even more so, thousands of new individuals are entering the workforce every month and with increasing demand for jobs comes new accommodations for what defines as just. Again, I cannot stress enough how important this study is to those already in the workplace or to those who are looking to make a change into any work environment. Journal of Intelligent & Fuzzy Systems 38 (2020) 6159–6173 6159 DOI:10.3233/JIFS-179698 IOS Press The impact of big data market segmentation using data mining and clustering techniques Fahed Yosepha,b,∗ , Nurul Hashimah Ahamed Hassain Malimb, Markku Heikkiläc, Adrian Brezulianud, Oana Gemane and Nur Aqilah Paskhal Rostamb a Faculty of Social Sciences, Business and Economics, Åbo Akademi University, Turku, Finland bDepartment of School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia cFaculty of Social Sciences, Business and Economics, Åbo Akademi University, Turku, Finland dFaculty of Electronics, Telecommunications and Information Technology, Gheorghe Asachi Technical University, Iaşi, Romania eDepartment of Health and Human Development, Stefan cel Mare University, Suceava, Romania Abstract. Targeted marketing strategy is a prominent topic that has received substantial attention from both industries and
  • 8. academia. Market segmentation is a widely used approach in investigating the heterogeneity of customer buying behavior and profitability. It is important to note that conventional market segmentation models in the retail industry are predominantly descriptive methods, lack sufficient market insights, and often fail to identify sufficiently small segments. This study also takes advantage of the dynamics involved in the Hadoop distributed file system for its ability to process vast dataset. Three different market segmentation experiments using modified best fit regression, i.e., Expectation-Maximization (EM) and K- Means++ clustering algorithms were conducted and subsequently assessed using cluster quality assessment. The results of this research are twofold: i) The insight on customer purchase behavior revealed for each Customer Lifetime Value (CLTV) segment; ii) performance of the clustering algorithm for producing accurate market segments. The analysis indicated that the average lifetime of the customer was only two years, and the churn rate was 52%. Consequently, a marketing strategy was devised based on these results and implemented on the departmental store sales. It was revealed in the marketing record that the sales growth rate up increased from 5% to 9%. Keywords: Market segmentation, data mining, customer lifetime value (CLTV), RFM model (recency frequency monetary) 1. Introduction is the key success to brand loyalty, repeat store visits, and ultimately, sales conversions. This relationship The retail industry collects enormous volumes of has been affected by recent economic and social. The
  • 9. POS data. However, this RAW POS data has min- retail industry is prompted to be more strategic in imal use if it’s not properly processed to generate their planning and to develop a deep understanding retail insights, optimize marketing efforts and drive of its consumers as well as their competitors. Under- decisions. The retailer’s relationship with customers standing customers’ behavior as well as establishing a loyal relationship with customers has become the cen- ∗ Corresponding author. Fahed Yoseph, Faculty of Social Sci - tral concern and strategic goal for most retailers [1] °ences, Business and Economics, Abo Akademi University, Turku, interested in tracking and managing their customer Finland, and Deparment f School of Computer Sciences, Uni - lifetime value on a systematic basis [44]. Market seg-versiti Sains Malaysia, 11800, Penang, Malaysia. E-mail: [email protected] mentation is the process to divide the market base ISSN 1064-1246/20/$35.00 © 2020 – IOS Press and the authors. All rights reserved mailto:[email protected] https://1064-1246/20/$35.00 6160 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques of potential customers into similar or homogeneous groups or segments that possess mutual characteris-
  • 10. tics helps marketers to gather individuals with similar choices and interests [2]. This enables retailers to avoid selling unprofitable and irrelevant products with regards to their marketing purpose, which will result in better management of the available resources through the selection of suitable market segment and the primary focus of specific promising segments [3–15]. Furthermore, as far as the research scope is concerned, there has been number of studies that examine customer purchase behavior and lifetime value among different products based on a variety of market segmentation with demographic variables and characteristics. Instead of addressing individual consumers based on their purchasing behavior, most market segmentation studies merely considered the overview of consumers’ historical data to produce assumptions of what makes consumers similar to one another. It is significant to highlight that this method hides critical facts about individual consumers. Among those customer lifetime value models, a highly regarded model cited by many experts is the Pareto/NBD Counting Your Customers proposed by Schmittlein, Morrison, and Colombo (1987). The model investigates customer purchase behavior in settings where customer purchase dropout is unob- served. However, the model is powerful for analyzing customer purchase behavior, but it has been proven to be empirically complex to implement due to the computational challenges, and only a handful of researches claim to have implemented it [44]. Based on previous studies of market segmenta- tion on the retail domain, Recency, Frequency, and
  • 11. Monetary (RFM) has been extensively employed as this model can divide customers into groups which, therefore, enables retailers to decide on ways to fully utilize their limited resources in providing effective customer service through the categorization of cus- tomers. Nonetheless, RFM also has its own limitation [4] where it only focuses on customers’ best scores in addition to providing less meaningful scoring on recency, frequency and monetary for most consumers (Wei, Lin, and Wu, 2010). Moreover, RFM analy- sis is not able to prospect for new customers, as it mainly concerns the organization’s current customers [6] and that it is not considered as a precise quan- titative analysis model as the importance of each RFM measure is different among other industries [16–20]. The current research foresees an enhanced user-friendly market segmentation modeling method, which is more advanced and effective than conven- tional RFM method. The integration of Customer Lifetime Value and newly proposed RFM variants (PQ) (T) into a closed-loop model represents dif- ferent variation in customer purchase behavior. The enhanced model has the capability to simultane- ously analyze millions of raw POS data, identify groups of customers by criteria the retailer may never have considered. This goldmine knowledge is expected to help marketers avoid the assumptions when doing customer deep-dive and trend analysis, which subsequently tapped marketers to device tar- geted marketing campaign resulting in sales growth and higher ROI. The RFMPQ and RFMT dataset con- centrate on the idea of identifying the purchasing power history of an individual customer or segment. P variable represents the average purchasing power per customer per all transactions, Q variable repre-
  • 12. sents the average purchasing power per product, and T represents the change of consumer buying behav- ior or trend using change rate. The enhanced RFM model also incorporates CLTV for predicting future cash flow attributed to the customer’s shopping period with the retailer [8], followed by applying a mod- ified best-fit regression technique, and K-means++ and Expectation-Maximization (EM) clustering algo- rithms to analyze the customer buying behavior as well as to assess the clustering technique’s perfor- mance using cluster quality assessment. The analysis can also identify marketers’ area of focus and ensure the highest quality of customer service. 2. Installing and using the microsoft word template Market segmentation is the process of categorizing large homogenous market into similar or homoge- neous smaller groups who share characteristics such as income, shopping habits, lifestyle, age, and per- sonality traits [9]. These segments are relevant to marketing and sales and can be used to optimize products, customer service and advertising to differ- ent consumers [6]; It is seen that many companies across the retail industry have identified customer service as a market key differentiator and tend to segment their customers for positive customer expe- rience and service delivery [10]. There are three types of market segmentation bases, namely demo- graphic, geographic, and behavioral. Demographic segmentation is the most commonly used variable when segmenting a market. It has the ability to
  • 13. F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6161 provide the retailer with a clear vision with the future advertising plans, a precise customer shop- ping profile and focuses on the measurable factors of consumers and their households. Furthermore, this segment is primarily descriptive in terms of gender, race, age, income, lifestyle, and family status [11]. In contrast, behavioral segmentation divides cus- tomers based on their attitude toward products. Many marketers consider behavioral variables such as occa- sions, benefits, usage rate, customer status, readiness, loyalty status, and attitude towards a product as the ideal starting points for creating market segmentation [11]. According to behavioral segmentation, con- sumer behavior is the segmentation process based on their evaluation and buying activities, as well as the use and disposal of goods to recognize consumer needs. These criteria can provide a thorough under- standing of consumer behavior as they reason from social psychology, anthropology, economics, sociol- ogy, and psychology that influence consumers on their purchasing decision of products [21–25]. To get a sense of the overall customer lifetime value for the customer-base, [45] proposed a framework to integrate customers’ distribution with the iso-value curves, by grouping customers on the basis of RFM characteristics and to understand the factors that trig- ger consumer’s defection. [48] proposed analytical model for consumer engagement, related to the subse- quent stages of the consumer life-cycle like customer development, customer acquisition, and customer retention. The authors concluded that the availabil- ity of data is vital to the development of advanced analysis in each consumer’s stage. However, sev-
  • 14. eral organizational issues of analytics for consumer engagement remain, which constitute barriers to implementing analytics for customer engagement. In order to solve the problems of consumer behav- ior that evolved with time, this research examines the behavioral, demographic segmentation model and identifies customer behavior using model Customer Life Time Value (CLTV) and Recency, Frequency and Monetary (RFM) model. 2.1. Customer life time value (CLTV) model Customer Lifetime Value (CLTV) is an important metric to measure the total worth or profit to a busi- ness obtained from a customer over the whole period of their relationship with the retailer [8]. The liter- ature defines the customers churn as the extinction of the contract between the firm and the customer, where customer retention refers to the collection of activities organizations take to reduce the number of customer’s defections. Churn rate and retention rate critical matrix for any company and considered primary components of the future CLTV. Where CLTV is an estimation of the average profit, a customer is expected to gen- erate before he or she churn [48]. The concept of retention and churn is often correlated with industry life-cycle. When the industry is in the growth phase of its life-cycle, sales increase exponentially. How- ever, customer churn is the most challenging task for the retailer industry. In this perspective, more insight is needed to know the reason for customer churn in a dynamic industry.
  • 15. The three main components of CLTV are customer acquisition, customer expansion, and customer reten- tion [46]. Nevertheless, it is crucial to consider COGS (Cost of Goods Sold) and acquisition cost to square off the real CLTV. The basic model to calculate CLTV is presented in Equation (1). �n ptCLTV = (r)t (1) t=1 (1 + d)t The above CLTV formula is more of a proxy for an average customer who stays for X period of time and pays Y total amount of money. The t represents a specific period of time, while (t = 1) represents the first year, and (t = 2) denotes the second year. The n represents the total time period the customer will stay with the retailer before churn occurs. The r represents the month over retention rate. Pt is the profit that the customer/customers will contribute or generate to the Retailer in the Period t, and finally, d refers to the churn rate. Additionally, the customer’s loyalty can be calculated using the Retention Rate formula, as illustrated in Equation (2). Based on the Retention Rate formula, CE denotes to the number of customers at the end of each time period, where, CN is the total number of new customers acquired in the chosen time period, and CS denotes to the number of customers at the start of the time period. � � (CE - CN) Retention rate = × 100 (2) CS Management of consumer retention requires the
  • 16. tools that allow decision-makers to assess the risk of each consumer to defect and understanding the factors that trigger consumers’ defection [47]. Cus- tomer retention strategy also known as a loyalty rate is the collection of activities a retailer uses to maintain on a long-term relationship basis by engag- ing existing customers to increase profitability by 6162 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques Table 1 Criteria of customers in each segment Segment Criteria of Customers Best The average purchase amount > the total average purchase amount The average purchase frequency of customer > the total average frequency Spender The average purchase amount > the total average purchase amount The Average frequency of customer < the total average frequency Frequent The average purchase amount < the total average purchase amount The average frequency of customer > the total average frequency
  • 17. Uncertain The average purchase amount < the total average purchase amount The average frequency of customer < the total average frequency increasing the number of repeat customers. CLTV represents a greater improvement compared to the tra- ditional RFM analysis as the frequency of customer’s purchases, and the amount of customers’ average pur- chase is used for segmenting customers-base. CLTV matrix classifies customer purchase behavior using different segments, namely Best, Spender, Frequent, and Uncertain classified by Marcus (1998). Table 1 illustrates the criteria of each segment that were clas- sified by Marcus (1998). 2.2. Recency, frequency, and monetary (RFM) model RFM is a standard statistical marketing model for customer behavior segmentation assess consumer lifetime value. The model is very popular in the retail industry as it groups customers based on their shopping power history – how recently, how often, and how much did the customer buy. RFM model helps retailers group customers into various segments or categories to identify customers who are more likely to respond to marketing promotions and future customer personalization services [17]. The R sym- bolizes recency refers to the interval between the time since last purchase the customer made. The F sym- bolizes the frequency of consumer behavior in a time period, and the M symbolizes monetary referring to the amount of money consumption in a period [18]. Quintiles scoring is the most commonly used scor-
  • 18. ing in the RFM method in arranging customers in ascending or descending order or (Best to Worst). Customers are grouped into five equal groups where the best group receives the highest score of (5), and the worst receives the lowest score of (1) [1]. The RFM score is the weighted average of its individual components and is calculated as portrayed in equation 3 and 4 to derive a continuous RFM Score. Finally, these scores can be re-scaled to the 0 –1 range [17]. RFM score = (recency score × recency weight) + (frequency score × frequency weigh + (monetary score x monetary weight)) (3) Rescaled RFM score = (RFM score − minimum RFM Score)/(Maximum RFM score − minimum RFM score) (4) 2.3. Market segmentation using data mining, RFM, CLTV models and clustering techniques Market segmentation helps to differentiate and cus- tomize marketing strategies into segments. Market Segmentation is a significant key in data mining, where data mining is used to interrogate segmenta- tion data to create data-driven behavioral information segments that are applied to detect meaningful pat- terns and rules underlying consumer behavior [19]. Furthermore, [26] and [27] were among the stud- ies that performed market segmentation using data mining, RFM, CLTV, and clustering technique to form a decision-making system. [28] proposed clus- tering and profiling of customers using customer
  • 19. relationship management (CRM) and RFM for rec- ommendations were proposed. On the other hand, data mining was conducted on historical data of cus- tomer’s sales using the RFM model with K-Means algorithm where results have outlined recommen- dations to perform customer relationship strategy. Also, f (2016) proposed a three-dimensional mar- ket segmentation model based on customer lifetime value, customer activity, and customer satisfaction. For more accuracy, the author grouped customers into several different groups. RFM, Kano, and BG/NBD models obtained the corresponding variables. Furthermore, the market segmentation model helps enterprises to maximize their profits. In [29–35], cus- tomers were classified into various clusters using RFM technique and association rules were mined to identify high-profit customers. RFM statistical Tech- niques and Clustering methods for Customer Value F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6163 Analysis were combined in [26] for a company’s online selling. As far as the methodology is con- cerned, there is no standard convention for measuring customer purchase behavior as each literature differs in examining customer purchase behavior. Neverthe- less, the K-Means algorithm is noted as an extensively used clustering algorithm in previous research due to its simplicity and speed in working with a large amount of data [36]. Despite its strength, it has been recorded that the K-Means method uses a random dis- tribution for the seeding positions and does not main-
  • 20. tain the same result each time that it is run [37]. New, improved K-means algorithm called K-means++, which uses sophisticated seeding procedure for the initial choice of the center positions and often twice as fast as the standard k-means. Contrastingly, Neha, Kirti, and Kanika (2012) noted that K-means and (EM) Expectation-Maximization algorithms are the two most commonly used based algorithms for the identification of growth patterns. Even though EM is similar to the K-Means algorithm, this algorithm is based on two different steps iterated until there are no more changes in the current hypothesis [29]. Expectation (E) refers to computing the probability that each datum is a member of each class. Maximiza- tion (M) refers to altering the parameters of each class to maximize those probabilities. Eventually, they con- verge, although not necessarily correct. Furthermore, EM algorithm is embedded with a significant feature where it can be applied to problems with observed data that provide “partial” information only [30]. Based on several comparative studies of EM and K- Means methods [31–34], it was observed that EM outperformed K-Means and results were improved when they were hybridized. The current study inte- grates two dynamics models, namely CLTV and RFM models, with the addition of new RFM variants, i.e., P, Q and T to cater the weakness and inaccuracy of consumer modeling that are caused by the limita- tions RFM. In addition, this study applies K-means++ and Expectation Maximization (EM) clustering algo- rithms to offer the retail industry with effective analy- sis of customer buying behavior through the combina- tion of customer profitability and product profitability in creating a strategic marketing campaign as explained previously in the introduction [38–40].
  • 21. 2.4. Mining big data Data mining is the process of extracting infor- mation from large data sets and transform it into an understandable form for further use. Data min- ing can be used in such a case where the database is large, and the classification of such data is dif- ficult [35]. The term Big data is often used for very large databases whose size in terabytes to many PETA bytes and it is beyond the ability of commonly used Relational Database Management (RDBM) to pro- cess the data within a tolerable elapsed time. Patel, Birla, & Nair (2012) have done a lot of experiment on the big data problem. The result was the finding Hadoop Distributed File System (HDFS) for storage and map-reduce method for parallel processing on a large volume of data. However, the research in Big Data analysis using data mining especially with clus- tering methods is still considered to be young, and therefore attracts many researchers to conduct fur- ther research in this potential area [37, 38] proposed a fast-parallel k-means clustering algorithm based on Map Reduce, which has been widely embraced by both academia and industry. They used to speed up, scale-up, and size up to evaluate the performances of their proposed algorithm. Their finding showed that the proposed model could process very large dataset on commodity (Low-cost) hardware effectively. Hadoop is becoming a commodity for every data- driven organization, where data is larger and comes in many formats, mining and extracting intelligence from data has always been a challenge [39]. The new dynamic in the database has brought new chal- lenges to the current analytical models and traditional
  • 22. databases and emphasize the need for a paradigm shift in data extraction and data analysis. Such challenges are the performance of the data retrieval and the vari - eties of data sources for which the format of the relational databases may no longer be the best option. [39] stated that Traditional database systems fall short in handling scalability to boost the performance effi - ciency and dealing with Big Data effectively and thus the adoption of based systems such as Hadoop is increasing. Hadoop is an open-source framework for data-intensive distributed system processing of large-scale data, based on Map Reduce programming model and a distributed file system called Hadoop Distributed File system (HDFS). Map Reduce programming model is a methodol- ogy that deals with implementation and generating large datasets, making Hadoop the preferred as a solu- tion to the problems in the traditional Data Mining [41]. The main components of Hadoop are Hadoop distributed file system (HDFS) a high bandwidth clustered storage allows writing an application that rapidly processes massive data in parallel, which is 6164 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques vital for large files. Map Reduce, is the heart of Hadoop. HDFS is high bandwidth clustered storage, while Map-Reduce processing enormous pieces of data and divide the input dataset into independent smaller pieces and be distributed amongst multiple
  • 23. machines referred to as nodes to parallel process them [42]. 3. Research method The methodology of this work, which involved Five phases, is outlined in Fig. 1. The first phase focused on the implementation of our POS database on a single node Hadoop distributed file system. The experiments were performed on the following system: Single node using Java: Hardware: Intel Core i5, 8GB RAM, CPU 2.4 GHz Software: Java with JDK 1.8 Hadoop implementation: Software: Ubuntu 16.10, Java with JDK 1.8, Hadoop 2.7.0. The sec- ond phase focused on (ETL) Extract, Transform, and Load involving data preprocessing steps, i.e., data cleansing, features selection, and data transforma- tion. Since the dataset that was used in the current research was different than those in existing litera- ture, a controlled experiment was performed where the work of [20] was replicated as the baseline of this research. The hybrid approach (RFMPQ & CLTV) was included in the third and fourth phase, where differ- ent methods were employed stepwise. As the data was transformed into three different variants, i.e., RFM, PQ, and T, the first processing step differed in one of them. The classification was used to categorize cus- tomer purchase behavior into CLTV matrix based on the RFM and PQ dataset, while modified best-fit regression was performed on the T dataset to find the customer purchase trend (curve). Even though
  • 24. [20] only employed K-means technique, the present experiment was extended to include the utilization of K-means++ and (EM). Subsequently, the outputs were fed into the clus- tering algorithms, i.e., K-means++ and EM at the fourth phase for further demographical segmentation. The accuracy of these clustering algorithms was mea- sured during the final phase using the cluster quality assessment that was introduced by Draghici & Kuklin (2003). Additionally, the retention rate was calcu- lated, and human judgment was also included as a measure of the effectiveness of this method for a marketing campaign. Fig. 1. Methodologies for the hybrid of classification and cluster- ing of market segmentation. The Methodologies for this research is illustrated in Fig. 1. 3.1. Proposed data transformation using RFM model, RFMPQ, RFMT In this study we are using Apache Hadoop on single-node Hadoop cluster using Ubuntu Linux 12.04 64 Bits Server Edition was preferred as the operating system and KVM (virtual memory) was selected as virtualization environment. Hadoop (HDFS) node was accessed via Secure Shell (SSH). In this study, no parameter or optimization adjust- ment was made on the operating system to cause performance improvement. This type of Hadoop implementation serves the purpose and sufficient to have a running Hadoop environment in order to con-
  • 25. duct our experiment. The market segmentation model uses retail POS data acquired from a medium-sized retailer from the State of Kuwait. The POS data con- tains Three years (2012 – 2015) of customers initial and repeated purchases who made their purchases at different geographical branches. Each transac- tion represented a product purchased, with each line consists of a cashier number, store-code, item-code, brand- code, product (quantity) sold, product price, date and time of the transaction, sub-total, grand total as well as the customer’s demographic information. F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6165 Since the data was in separate compilations based on the stores’ geographic locations, a common data for- mat with consistent definitions for more descriptive keys and fields was developed to merge the informa- tion. In this phase, the string variables are converted to numeric variables, and subsequently, missing values were checked and replaced default or mean values manually. In this research, the PQ variants will be used to describe customer’s purchase power in differ- ent demographic and behavioral eras and customer’s attractiveness to a specific product and service. For the T variant customers from the best segment are used to identify customer purchase curve, and we cal- ibrate the market segmentation model using repeated transactions for 3220 customers over two years’ period. The age attribute was grouped into four (i.e., ages from 1–17, ages from 18–24, ages from 25–34, ages
  • 26. from 35–44, ages from 45–54 and ages above 55). The age group analysis was based on the premise that a typical customer’s needs would change as they age. The customer’s age was classified into six categories, where each category was identified using a unique number. Category 1 = (1–17), Cat- egory 2 = (18–24), Category 3 = (25–34), Category 4 = (35–44), Category 5 = (45–54), and senior Cate- gory 6 = (55 +). The Gender attribute was encoded as 1 for Male customers, 2 for Female customers, and 3 for Companies. Furthermore, the demographic con- cept hierarchy method such as city and country was replaced by higher-level concept nationality. Citizens of Middle Eastern nationalities, Asian nationalities, USA, and Canada, were assigned unique numeric (binary) value. Other nationalities were grouped based on continents, namely Europeans and Africans. One exception was made for British nationalities due to the high volume of purchases. To ensure the maximum accuracy of RFM scores, the values of five-dimension attributes from the POS Data were necessary. The attributes are described in Table 2 as follows: The next step involved the calculation of RFM scores as well as the newly proposed variation PQ and T. The implementation of CLVT, retention rate, and RFMPQ and T are developed using advanced PL/SQL programming language. It must be noted that the RFMPQ score refers to the weighted aver- age of its individual components in which the scoring analysis typically involves grouping customers into equal buckets (quantiles) sizes. As far as this study is concerned, the grouping procedure was applied inde- pendently to the five RFMPQ component measures.
  • 27. Table 2 Attributes of RFMPQ CUSTOMER ID Customer unique identifier used to capture customer’s related information. TRX DATE: Transaction date used to capture customer’s Recency (R). TXH COUNT Number of Transaction used to capture the Frequency (F) number of each transaction made by a customer. TRX TOTAL SALE The total amount of each transaction used to capture the Monetary (M) value made by the customer. TRXUNIT PRICE The average purchase power (Monetary) used to capture Average Monetary (P) per customer. TRX QTY The average purchase power (quantity - Q) used to capture the Average Items purchased per customer. Customers were grouped according to the respec- tive measure into classes of equal sizes. The derived R, F, M, P, and Q groups became the components for the RFMPQ cell assignment. RFMPQ groups are aggregated, with appropriate client-defined weights, and the scores were the weighted average of its com- ponents. The next step involved where values and scores
  • 28. of RFMPQ variables were determined and used as inputs of clustering algorithms. According to this research and the client’s requests, RFMPQ variables had equal weights (1:1:1). The second variant proposed in this research was T variable, which represented the trend of customer purchase behavior using change rate. This study pro- poses a combination of two analytical data mining steps. For finding T, the change in consumer purchase behavior trend, it uses an enhanced best-fit regression algorithm. Then T dataset is then put into the unsu- pervised clustering algorithms, to split the consumers into different groups based on patterns dissimilarities. The variable T should answer a very important ques- tion if the consumer is at high risk of shifting his or her service to another retailer. One of the most com- mon indicators of high-risk consumers is a drop off in purchase power and a decrease of visits. One of the most common indicators of high-risk consumers is a drop off in purchase power and a decrease of visits. Major limitations of market segmentation and RFM models is ignoring the behavioral changes of con- sumers during the time period of analysis. Although the recency variable acted as one of the indicators of such consumer behavior, it was affected by cus- tomers’ transient behavior, and it was only based 6166 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques on the last purchase date. Therefore, introducing a new analysis variable was essential for the retailer to narrow down high-risk consumers.
  • 29. Based on the perspective of a retailer, average val - ues change according to customers’ purchase and their satisfaction with the retailer in terms of price and services. If these average purchase amounts of a customer decrease continuously, it could be con- cluded that this customer is or was on the verge of shifting his business to another retailer. A similar conclusion can also be derived from a customer who had an increased average purchase value, showing that the customer was very profitable and should be ranked accordingly. Therefore, these customers with a variety must be treated accordingly. A computable parameter was developed through the introduction of some advanced programming. Firstly, the purchase amounts of all customers in each selected time period of analysis were required. 3.2. Market segmentation using K-means++ and EM clustering methods The drawback of the classic k-means algorithm is that the user needs to define the centroid point and offers no accuracy guarantees. This has become more critical when clustering documentation because each center point is represented by a word and the distance calculation between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. K-means++ is a simple probabilistic means of initializing k- means clustering that not only has the best known theoretical guarantees on the expected outcome qual- ity but works very well in practice. According to [43–51] k-means++ algorithm is another variation of the k-means algorithm, a new approach to select ini- tial cluster centers by random starting centers with
  • 30. specific probabilities are used. The steps used in this algorithm are described below. In this regard, the essential component required is the preserva- tion of the diversity of seeds while ensuring that the outliers remain robust. The primary concern of the k-means problem is to identify cluster centers that minimize intra-class variance by reducing the distances from each clustered data point. This can be achieved through an effective and well-designed cluster-initialization technique. k-means++ was pro- posed in 2007 by [42] for choosing initial values (seeds) for the classic k-means clustering algorithm to avoid poor results found by the k-means clustering algorithm. K-means++ algorithm initializes means more intelligently so that there is a distribution of cluster means that is roughly even relative to the data. K-means++ accomplishes this by selecting the first cluster center at random and then drawing the next sample from a distribution that puts a heavy proba- bility weight where there are data and no close-by cluster center. The execution of K-Means++ and EM algorithms is carried out using WEKA tools, then the generated results are exported into excel for easy comparative analysis. We evaluate the performance of the market basket based on the mean calculated across three years forecast customers’ sales transac- tion. The K-means++ algorithm equation’s method is explained below. The K-means++ algorithm comes with a theoretical guarantee to find a solution that is O (log k) competitive to the optimal k-means solution. It is also fairly simple to describe and implement. Expectation-maximization (EM) algorithm is an iterative estimation algorithm, a method similar to the K-Means algorithm introduced by Dempster, Laird,
  • 31. & Rubin (1977). The Expectation-Maximization algorithm is an important tool of a statistical and pow- erful method for obtaining the maximum likelihood estimation of the parameters of an underlying distri- bution when data contains null and missing values to generate an accurate hypothesis. There are three steps involved in EM technique, and the first was the EM clustering initialization. Every class j, of M classes (clusters), is formed by a vector parameter (θ), composed by two parameters the mean (�j) and the covariance matrix (Pj) which defined the Gaus- sian probability normal distribution as features used to describe or classify the observed and unobserved entities of the data set x. The Expectation Maximiza- tion (EM) algorithm was aimed to approximate the parameter vector (θ) of the real distribution. Clus- ter Convergence was the third step in M-step. After every iteration was performed, a convergence inspec- tion was conducted to verify whether the difference of the attributes vector of an iteration to the previous iter - ation was smaller than an acceptable error tolerance given by parameter. 3.3. Quality assessment One type of cluster quality assessment as suggested in [41] was performed in this study is to compare between the size (diameter) of the clusters versus the distance to the nearest cluster, the inter-cluster dis- tance versus the size of the cluster is conducted in this study. This process can also be understood in terms of the distance between members of each cluster and F. Yoseph et al. / The impact of big data market segmentation
  • 32. using data mining and clustering techniques 6167 Fig. 2. Cluster quality accessed by ratio of distance to the nearest cluster and cluster diameter: source (Draghici, 2003). the center of the cluster, and the diameter (size) of the smallest sphere containing the cluster. If the inter- cluster distance was much larger than the size of the clusters, then the clustering algorithm is considered to be trustworthy. The formula is explained as; min (Dij)/max (di) where Dij was the distance between cluster I and cluster j, where both i and j were from (1 to 6) and di is the diameter. Figure 2 shows the quality can be assessed sim- ply by looking at the cluster diameter. Therefore, the cluster can be formed by heuristic even when there is no similarity between clustered patterns. This is occurring because the algorithm forces K clusters to be created. 4. Results and discussion on RFM and RFMPQ dataset The main objective driven by the customer pyramid classification is to determine the segment in which customers spend more with the retailer over a period of time and also the segment that is less costly to maintain. The distribution of customers after using CLTV matrix on the RFMPQ dataset is illustrated in Fig. 3 with respect to their frequency and average monetary values. The findings generated by customer value matrix classification revealed that there were 7024 customers from who were categorized under the Best segment, 25153 customers fell under the
  • 33. Spender segment, 39107 customers were grouped in the Frequent segment, and 39624 customers were classified in the Uncertain segment. These findings indicated that most customers were shoppers with a limited and high budget. Based on the results in Fig. 4, it can be observed that the analysis of K-means++ and EM clustering algorithms illustrated that both algorithms are agree- able on the gender and age segments. Cluster 1 is Fig. 3. Effective customer pyramid classification. the most beneficial segment of customers. The best spenders in cluster 1 are predominantly females from the age group (25–34) and (35–44). The cluster qual- ity assessment of RFM dataset is shown in Table 3, clearly shows the K-means++ algorithm with the size of cluster 328.4529657 compare with the EM algo- rithm 271.6329114 is far more accurate, because it’s largest value of inter-cluster distance divided by the size of the cluster. Since classification, according to the CLTV matrix was already applied prior to clustering, discussion on the results of this experiment is divided into subsec- tion according to the quadrant of the CLTV matrix. The first section will elaborate on the demographic clustering for gender and age, where summaries for both gender and age are provided for each segment. The following section will discuss the cluster aver- age shopping (visits) frequency, the cluster average monetary, and the cluster average spending per visit. 4.1. RFMPQ results using K-means++ and EM on best, spender, frequent, and uncertain
  • 34. segments Results in Fig. 5. shows the summary for all four segments, namely best, spender, frequent and uncer- tain where most customers were females and their average age according to the cluster. In general, K- means++ is seen as the best clustering method except in the Best segment; the accuracy is very close due to the less number of customers. Each cluster will be dis- cussed in further details with respect to the nationality in the following sections. Figure 6 shows the four clusters for Uncertai n seg- ment as generated by K-means++ and EM Clustering. Total customers in this segment are 39624. The anal- ysis in both clustering algorithms showed that cluster 1 was the most beneficial cluster (segment) because 6168 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques Fig. 4. The average of four clusters generated by the K- means++ algorithm and EM algorithm for gender and age. it was more superior to the other clusters in terms of the demographic variables. Average age results portrayed that customers were between the age of (45–54) are the most beneficial customers, and the average gender indicated most customers were male customers. In terms of the average nationality, citizens of the Philippines were considered as the best customers in average monetary when using K-Means++ and citizens of India were the
  • 35. best customers based on EM Clustering. 4.2. Summary of the market segmentation on RFMPQ dataset With regards to the Market Segmentation on RFMPQ dataset, it can be summarized that Best seg- ment was the most profitable segment with its Total Average Spending Per Segment of (1109.37) com- pared to other segments. Cluster 2 was the most profitable cluster in the Best segment with most female customers and the age groups of (25–34) and (35–44). Meanwhile, the result for average national- ity portrayed that citizens of Kuwait and UAE were the best customers in average monetary and it was dis- covered that both K-means++ and EM showed similar accuracy. The Spender segment was the second most profitable segment with the Total Average Spending Per Segment of (834.01). Cluster 4 of this segment was the most profitable cluster with an average mon- etary of (924.19). Females customers, those between the ages of (25–34) and (35–44) as well as the citizens of Kuwait and Qatar were the best customers, and it was also shown that K-Means++ algorithm was the most accu- rate. Next, the Frequent segment was the third most profitable segment with the Total Average Spending Per Segment of (349.57), and it was noted that cluster 1 was the most profitable cluster with an average mon- etary of (397.81). Female customers, the age group of (35–44) and citizens of the UAE and Bahrain were the best customers. K-means++ algorithm was regarded as the most accurate algorithm for this cluster in Fre- quent segment.
  • 36. The Uncertain segment was the least profitable seg- ment with its Total Average Spending Per Segment (168.59). Cluster 1 was the least profitable clus- ter with an average monetary of (239.14). Female customers, those in the age group of (45–44) and cit- izens of the Philippines and some Arab nationalities were the best customers, and K-means++ algorithm was the most accurate algorithm for this cluster in Uncertain segment. Based on the comprehensive information extracted from the RFMPQ dataset, it is concluded that (a) the newly proposed PQ method can fulfill both robust classification and robust seg- mentation for market segmentation model, even in the dataset that has noisy data. Also, it is noted that (b) The (PQ) is the most important variable, and it has also been proven as a crucial parameter for clustering. 4.3. Predictive customer lifetime value Predictive CLTV Matrix projects the new cus- tomers’ expenses over their entire lifetime with the retailer. Most retailers measure CLTV in dollars or based on the retailer’s local currency spent by the cus- tomer over their entire relationship with the retailer, i.e., from the first to last transaction. Nevertheless, it F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6169 Fig. 5. Summary for all segments in demographic clustering (Gender). Fig. 6. Four clusters generated by K-Means++ and EM
  • 37. clustering for uncertain segment. is essential to consider COGS (Cost of Goods Sold) customers’ expected lifetime and potential monetary and acquisition cost to square off the real CLTV. value from new purchases. Furthermore, based on the Predictive Customer Lifetime Value (CLTV): Estima- clients’ request, the static discount rate is set 25% in, tion of the customer’s future value also considers the of which 25% is minced from the total sales. The 6170 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques Fig. 7. The result from the strategies implemented using ranking matrix. acquisition cost, which is also set based on clients’ request, which is 400 becomes the estimated figure or percentage of how much is spent to attract new customers. 4.4. Predictive customer lifetime value Predictive CLTV Matrix projects the new cus- tomers’ expenses over their entire lifetime with the retailer. Most retailers measure CLTV in dollars or based on the retailer’s local currency spent by the cus- tomer over their entire relationship with the retailer, i.e., from the first to last transaction. Nevertheless, it is essential to consider COGS (Cost of Goods Sold) and acquisition cost to square off the real CLTV. Predictive Customer Lifetime Value (CLTV): Estima- tion of the customer’s future value also considers the
  • 38. customers’ expected lifetime and potential monetary value from new purchases. Furthermore, based on the clients’ request, the static discount rate is set 25% in, of which 25% is minced from the total sales. The acquisition cost, which is also set based on clients’ request, which is 400 becomes the estimated figure or percentage of how much is spent to attract new customers. 4.5. Customer lifetime value ranking matrix CLTV ranking groups customers into quad- rants according to their profitability and retention propensity. CLTV ranking can assist marketers in performing more effective market segmentation [17]. This is accomplished through the incorporation of RFMPQ and RFMT values with the CLTV to rank the total sales quarterly and yearly using the ranking matrix analytical function, as shown in Fig. 7. Experimental results on CLTV using Ranking Matrix, which is presented in Fig. 7 reflects the success of the implementation of the market segmen- tation data mining model. Important to note, the test POS dataset is from 2012 to 2015. The client noticed a drop in sales and a decline in foot traffic in the sec- ond half of 2011. However, the development of our Market Segmentation model started in 2013, and our market segmentation model was implemented in the fourth quarter of 2014. The analysis indicated that the client had lost customers and sales immediately dropped in quarter 1 of 2015 as a direct result from the shift of mass marketing strategy to target mar- keting strategy using market segmentation methods. This is considered as a healthy result due to the shift-
  • 39. ing from mass marketing to more customized market segmentation. Nonetheless, in quarter 2 and onwards, the client has gained a small increase in sales, and a new group of customers has been retained. Results also revealed that the retailer started to turn profitable, starting with quarter 3 and big growth in quarter 4 of 2015. A consistent result was observed from each quar- ter in 2015. Thus, the client strengthened its market position and made slow the effects of a rapidly weak- ening overall retail market. It is generally portrayed that the average lifetime of the customer is only two years, and the churn rate is 52% (see Table 6), result- ing in the development of a comprehensive marketing strategy. Results have shown that the incorporation of RFMPQ model and CLTV. Matrix was the best method to categories and classified different consumer segments and different potential consumer segments by long term profitabil- ity. Each of the market segments will be further investigated and analyzed to provide specific charac- terizations, better understanding and identification of opportunities, as well as the profitability and the exis- tence of risks as each segment no longer encompasses the entire market. Based on the segments identified, the retailer should create a tailored campaign for each segment and offer separate services that distinguish F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6171 these segments to maximize responses according to
  • 40. each segment’s behavior. Finally, based on the exper- iment results, it can be positively concluded that the newly proposed RFM (PQ) and (T) were far more efficient than the traditional RFM in classifying and clustering the segmentation of customer value via RFM attributes. Kindly note that the implementation of this work was done in 2015; however, for some circumstances, our documentation and publication process was delayed for quite some time. 5. Conclusions This study has introduced a new technique of CLTV and proposed three new RFM variations P, Q, and T. Firstly, the new RFM variations P, Q, and T, which have some advantages over the standard RFM model. The advantage of these newly pro- posed methods is that the RFMPQ method takes into consideration the changes in the customer’s average shopping power over time as well as the average pur- chase power of products purchased over time by a given customer. “The new proposed PQ provided key attributes describing customer’s purchase behavior in different demographic eras. The P variant helped to identify which segment of customers spends more over a period of time and costs less to maintain. The Q variant helped to identify customer’s attractiveness to a specific product and service, and also helped to identify best-selling products, which resulted in increase of sales potential in terms of number of units of best-selling products that can be sold.” The market segments were constructed through several clustering algorithms such as K-Means++ and EM. To convert this idea into a computable dynamic parameter, newly modified best-fit regression algorithm and RFMT variable were proposed. The new modified regression
  • 41. algorithm demonstrated a high degree of accuracy in strengthening and reinforcing the effect of the customer’s most recent T purchase while justify- ing the importance of previous consumer’s purchase. However, more modified regression algorithm can be further extended to highlight key customer’s trends using new demographic variables, like profession, income level, spending methods, and online spend- ing. Results of the application of the new RFM PQ T variations have reflected their effectiveness for mar - ket segmentation and their ability to offer the retail industry with intelligent analysis to combine cus- tomer and product profitability. This type of analysis can also identify the area in which marketers should focus their attention to as well as ensuring the highest level of quality and customer service. Additionally, these models help retailers identify VIP consumers who are on the verge of shifting taken their business to another competitor. Products that are highly prof- itable and purchased by the most profitable segments can also be easily identified. Results of the analysis results as well as the marketing experts have agreed that the classification of customer purchase behavior using CLTV matrix against RFMPQ dataset revealed the most accurate and crucial information about cus- tomer purchase behavior. Furthermore, results have also indicated that age and gender variables provided an accurate result with an estimated analysis accuracy of 75%. Nevertheless, the nationality variable presented a low percentage of accuracy in both algorithms, possibly due to the missing and noisy data related to these variables. The results of applying the new RFM PQ and T variations have portrayed their effectiveness which resulted in
  • 42. the sales growth rate of up to 6% for market segmen- tation and their ability to offer the SMR industry with intelligent analysis to combine customer and product profitability. Secondly, the retail is data-driven industry and pro- cess large POS transactions Increase in the collection of data is often seen as bottlenecks for Big Data anal- ysis. Many retailers face the challenge of keeping data on a platform, which gives them a single con- sistent view. Although Hadoop is not the main focus of this study, however, in this study, the capabilities of Hadoop was investigated. Hadoop implementa- tion provided a highly scalable data distribution and lighting data analysis performance. Finally, it has been proven that sophisticated statistical modeling methods can provide useful information for experts, but at the same time they are costly and complex to implement and are likely to present a challenge to the implementation of marketing strategies. This study proposed a sim- ple yet powerful approach to market segmentation. The results have shown the model provides easy to implement and affordable market segmentation methodologies that deliver substantial value rela- tive to the amount of effort involved. The analysis results discovered from the (Best, Spender, Fre- quent, and Uncertain) provided the client a guided vision on how to calculate customer lifetime value and retention. Further search recommendation is to examine the implementation of large POS dataset on multiple clusters and scaling the cluster by adding extra nodes
  • 43. 6172 F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques across several inexpensive servers. Finally, ignor- ing consumer behavior purchase behavior trends can be disastrous. Further research is needed for more Explanatory data mining model like Market Basket Analysis to help to identify what the profile of the future customer might look like from a product per- spective. Funding This research received no external funding. Acknowledgments We are grateful for the support given by the Uni- versity Sains Malaysia where Y.A.F completed his Masters on this research, under the supervision of N.H.A.H.M. References [1] A.B. Patel, M. Birla and U. Nair, Addressing Big Data Problem Using Hadoop and Map Reduce, in 2012 Nirma University International Conference on Engineering (NUiCONE) (2012), pp. 1–5. [2] A. Chorianopoulos and K. Tsiptsis, Data Mining Techniques in CRM: Inside Customer Segmentation, no. 21. John Wiley & Sons., 2010. [3] A. David and S. Vassilvitskii, k-means++: The advantages
  • 44. of careful seeding, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007. [4] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, J of the R Stat Soc Ser B 39(1) (1977), 1–38. [5] A. Kak, Expectation-Maximization Algorithm for Cluster- ing Multidimensional Numerical Data, Purdue University, 2017. [6] A. Neha, A. Kirti and G. Kanika, Comparative Analysis of k-means and Enhanced K-means clustering algorithm for data mining, Int J Sci Eng Res 3(3), 2012. [7] B. Sohrabi and A. Khanlari, Customer Lifetime Value (CLV) Measurement Based on RFM Model, Iran Account Audit Rev 14(47) (2007), 7–20. [8] B. Tammo HA, et al., Analytics for customer engagement, Journal of Service Research 13(3) (2010), 341–356. [9] C.C. Aggarwal, Data Mining: The textbook. New York, USA: Springer, 2015. [10] C. Bauckhage, Lecture Notes on Data Science : k-Means Clustering Is Gaussian Mixture Modeling, B-IT, University of Bonn We, 2015. [11] C. Marcus, A Practical Yet Meaningful Approach to Customer Segmentation, J Consum Mark 15(5) (1998), 494–504. [12] D. Arthur and S. Vassilvitskii, k-means ++ : The Advan- tages of Careful Seeding, in SODA ’07 Proceedings of the
  • 45. eighteenth annual ACM-SIAM symposium on Discrete algo- rithms, (2007), pp. 1027–1035. [13] D. Birant, Data Mining Using RFM Analysis Knowledge- Oriented Applications in Data Mining, Prof Kimi InTech, 2011. [14] D.R. Kishor and N.B. Venkateswarlu, Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance, Cybern Inf Technol 16(2) (2016), 16–34. [15] F. Jiashuang, Y. Suihuai, Y. Mingjiu, et al., Optimal selection of design scheme in cloud environment: A novel hybrid approach of multi-criteria decision-making based on F-ANP and F-QFD, DOI: 10.3233/JIFS-190630, Journal: Journal of Intelligent & Fuzzy Systems, vol. Pre-press, no. Pre-press, (2019), pp. 1-18. [16] F. Peter, S. Bruce, G.S. Hardie and K. Lok Lee, Count- ing your customers the easy way: An alternative to the Pareto/NBD model, Marketing Science 24(2) (2005), 275–284. [17] G. Lefait and K. Tahar, Customer Segmentation Archi- tecture Based on Clustering Techniques, in 2010 Fourth International Conference on Digital Society, (2010), pp. 243–248. [18] G. Sunil, C.F. Mela and J.M. Vidal-Sanz, The Value of A ‘Free’ Customer, 2006. [19] I. He and C. Li, The Research and Application of Customer Segmentation on E-commerce Websites, in 2016 6th Inter-
  • 46. national Conference on Digital Home (ICDH), (2016), pp. 203–208. [20] I. Kolyshkina, E. Nankani, S.J. Simoff and S.M. Denize, RDMA Aware Networks Programming User Manual, in Australian & New Zealand Marketing Academy. Confer- ence (ANZMAC), (2010), pp. 1–7. [21] I. Yeh, K. Yang and T. Ting, Knowledge Discovery on RFM Model using Bernoulli Sequence, Expert Syst Appl 36(3) (2009), 5866–5871. [22] J.A. McCarty and M. Hastak, Segmentation Approaches in Data-Mining: A Comparison of RFM, CHAID, and Logistic Regression, J Bus Res 60 (2007), pp. 656–662. [23] J.L. Schafer, Analysis of Incomplete Multivariate Data, First edit. USA: CRC Press LLC, 1997. [24] J. Maryani and D. Riana, Clustering and Profiling of Customers Using RFM For Customer Relationship Manage- ment Recommendations, in 5th International Conference on Cyber and IT Service Management (CITSM), (2017), pp. 1–7. [25] J.R. Miglautsch, Thoughts on RFM scoring, J Database Mark Cust Strateg Manag 8(1) (2000), 67–72. [26] J. Wei, S. Lin and H. Wu, A review of the application of RFM model, African J Bus Manag 4(19) (2010), 4199–4206. [27] K. Dipanjan, S. Garla and C. Goutam, Comparison of Probabilistic-D and k-Means Clustering in Segment Pro-
  • 47. files for B2B Markets SAS Global Forum 2011 Management Comparison of Probabilistic-D and k-Means Clustering in Segment Profiles for B2B Markets, in SAS Global Forum, Las Vegas, NV, USA, 2011, no. April. [28] K.R. Kashwan and C.M. Velu, Customer Segmentation Using Clustering and Data Mining Techniques, Int J Comput Theory Eng 5(6), 2013. [29] M. Cleveland, N. Papadopoulos and M. Laroche, Identity, Demographics and Consumer Behaviors International mar- ket Segmentation Across Product Categories, Int Mark Rev 28(3) (2011), 244–266. [30] M. Khajvand, Z. Kiyana, A. Sarah and A. Somayeh, Esti- mating Customer Lifetime Value Based on RFM Analysis of F. Yoseph et al. / The impact of big data market segmentation using data mining and clustering techniques 6173 Customer Purchase Behavior : Case Study, Procedia Com- put Sci 3(2011) (2010), 57–63. [31] M. Rezaeinia and R. Rahmani, Recommender System Based on Customer Segmentation (RSCS), Kybernetes 45(7) (2016), 1129–1157. [32] M. Safari, Customer Lifetime Value to Managing Marketing Strategies in the Financial Services, Int Lett Soc Humanist Sci 42 (2015), 164–173. [33] N. Li, L. Zeng, Q. He and Z. Shi, Parallel Implementation of Apriori Algorithm Based on MapReduce, in 2012 13th
  • 48. ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, (2012), pp. 236–241. [34] O. Geman, et al., book chapter: From Fuzzy Expert Sys- tem to Artificial Neural Network: Application to Assisted Speech Therapy, INTECH, Artificial Neural Networks. Models and Applications, Edited by Joao Luis Garcia Rosa, Published: October 19th 2016, DOI: 10.5772/61493, ISBN: 978-953-51-2705, 2016. [35] O. Geman, et al., book chapter Mathematical Models Used in Intelligent Assistive Technologies: Response Surface Methodology in Software Tools Optimization for Medical Rehabilitation, Intel Syst Ref Library, Vol. 170, Hari- ton Costin et al. (Eds): Recent Advances in Intelligent Assistive Technologies: Paradigms and Applications, 978- 3-030-30816-2, 463829 1 En, (4), 2019 [36] P.K. Srimani and M.S. Koti, Evaluation of Principal Com- ponents Analysis (PCA) and Data Clustering Technique (DCT) on Medical Data, Int J Knowl Eng 3(2) (2012), 202–206. [37] P. Mishra and S. Dash, Developing RFM Model for Cus- tomer Segmentation in Retail Industry, Int J Mark Hum Resour Manag 1(1) (2010), 58–69. [38] R.A.I.T. Daoud, B. Belaid, A. Abdellah and L. Rachid, Combining RFM Model and Clustering Techniques for Customer Value Analysis of a Company selling online, in 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 2015, pp. 1–6. [39] R. Soudagar, Customer Segmentation and Strategy Defini-
  • 49. tion in Segments Case Study: An Internet Service Provider in Iran, Lulea University of technology Department, 2012. [40] R. Sethy and M. Panda, Big Data Analysis using Hadoop : A Survey International Journal of Advanced Research in Big Data Analysis using Hadoop : A Survey, Int J Adv Res Comput Sci Softw Eng 5(7), 2015. [41] S. Chen, Cheetah : A High Performance, Custom Data Ware- house on Top of MapReduce, in Proceedings of the VLDB Endowmen, (2010), pp. 1459–1468. [42] S. Draghici and A. Kuklin, Data Analysis Tools for DNA Microarrays, 1st editio., no. June 2003. CRC Press, 2003. [43] S. Mohammad, S. Hosseini, A. Maleki and M.R. Gho- lamian, Cluster Analysis using Data Mining Approach to Develop CRM Methodology to Assess the Customer Loy- alty, Expert Syst Appl 37(2010) (2010), 5259–5264. [44] S. Zahra, M.A. Ghazanfar, A. Khalid, M.A. Azam, U. Naeem and A. Prugel-bennett, Novel Centroid Selection Approaches for KMeans-Clustering Based Recommender Systems, Inf Sci (Ny) 320 (2015), 156–189. [45] T.K. Hun and R. Yazdanifard, The Impact of Proper Market- ing Communication Channels on Consumer’s Behavior and Segmentation Consumers, Asian J Bus Manag 2(2) (2014), 155–159. [46] T. White, Hadoop: The Definitive Guide, Third Edit. Sebastopol, United States: O’Reilly Media, Inc, USA, 2012. [47] T. Upadhyay, V. Atma and D. Vishal, Customer Profiling
  • 50. and Segmentation using Data Mining Techniques, Int J Comput Sci Commun 7(2016) (2016), 65–67. [48] V. Vandenbulcke, F. Lecron, C. Ducarroz and F. Fouss, Customer Segmentation Based On a Collaborative Recom- mendation System : Application to A Mass Retail Company, in Proceedings of the 42nd Annual Conference of the Euro- pean Marketing Academy, 2013, pp. 1–7. [49] V.S. Jose, M. Carl, F. Mela and S. Gupta, The value of a free customer. No. wb092903. Universidad Carlos III de Madrid. Departamento de Econom ́ia de la Empresa, 2009. [50] W. Dong, Y. Cai and T. Kwok Leung, Making EEMD more effective in extracting bearing fault features for intelligent bearing fault diagnosis by using blind fault component sep- aration, DOI: 10.3233/JIFS-169523, Journal: Journal of Intelligent & Fuzzy Systems 34(6) (2018), 3429–3441. [51] W. Zhao, H. Ma and Q. He, Parallel K -Means Clustering Based on MapReduce, in In IEEE International Conference on Cloud Computing, (2009), pp. 674–679. Copyright of Journal of Intelligent & Fuzzy Systems is the property of IOS Press and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.