Assignment 2: Cluster Analysis and Predictive Modelling

Assignment 2: ClusterAnalysisand
Predictive Modelling
BUS5PA - 19139507
SIDDHANTH CHAURASIYA 19139507

1 |
P a g e
1 9 1 3 9 5 0 7
PART A - SEGMENTATION BASED EXPLORATION OF CUSTOMERS
---------------------------------------------------------------------------------------------------------------------------
Thissectionof the report containsthe explorationandfindingsfromthe segmentationand
clusteringanalysisconductedonCHURN_TELECOMdataset,usingSASMiner.Three typesof
segmentation were carriedoutonthe basisof distinctdeterminants –Demographics,Customer
Statusand CustomerUsages.
Demographics basedProfiling:
Aftercreatingthe project, library anddiagram,we add the data-source andsetthe rolesof all the
variablesasinputexceptforChurnFlag(Target),Customerandsubscriptionidentifier(ID) and
Subscribername (Text).
We drag the data-source intothe diagramand connectitwithClusterandSegmentprofilesnodes.
Age,GenderandCustomerValue are selectedasthe variablesforthe Clusteraswell asthe Segment
Profile node forthisprofilingactivity. CustomerValuecontains68% missingvalues,andthus
imputingthose missingvalueswithasyntheticvalue (mean,median,max,etc.) wouldcreate avery
skeweddistribution;whichisn’tdesirable.Hence,CustomerValueisn’timputed.
Figure 1: Process flow for Demographics based Profiling.
Since the measurementscalesof the variablesselectedasthe inputfor Demographical profilingare
different,we keepthe methodfor‘Internal Standardization’as ‘Standardization’fromthe properties
panel of the Clusternode. Restall propertiesof nodesClusterandSegmentProfile are keptat
default.
Figure 2: Cluster and Segment Profile results for Demographical segmentation.
We founda goodcombinationof clusterswithfairamountof observationsineachsegment(Figure
2) aftersettingthe numberof clustersas4. The four segmentscouldbe broadlyclassifiedas:

2 |
P a g e
1 9 1 3 9 5 0 7
Cluster 1 – ValuableYoung Adults.
Thissegmentcanbe describedasa groupof Maleswhoare justabout start theirprofessional
careersand generate highcustomervalue forthe organization. Since thisclustershow the tendency
of highcustomervalue,the companyshould ensure retentionof thissegment.
Cluster 2 – Distressed Damsel.
Thisclustercan be bestexpressedasa segment of juvenileFemaleswhoaccumulateforarelatively
lowerCustomerValue. Thissegmentaccountsforlowercustomervalue,whichmaybe anindicator
that customersaren’tsatisfiedwiththe servicesofferedandmaychurninthe future.The company
shoulddevise plans,offersanddiscountstonegate the chancesof churnof thiscluster.
Figure 3: Results from Segment Profile node.
Cluster 3 – Stingy Seniors.
Thisgroup ischaracterizedbyseniormales whogenerate low valueforthe Telecomcompany. As
such,customersbelongingtothis segmentmayneedspecial attentionsastheyhave highlikelihood
of churning, asindicatedbytheirlowcustomervalue generation.
Cluster 4 – Bankableladies.
Thisclusteris classifiedbyelderwomenwhoproduce highvalueforthe company. The company
shouldlooktomaximize the value derivedfromthissegment.

3 |
P a g e
1 9 1 3 9 5 0 7
Figure 4: Variable significance for each cluster.
As observedfromFigure 4, Genderwasthe mostinfluential variable forthe classificationof
DistressedDamsel,StingySeniorsandBankable ladieswhileAge hasthe mostsignificance for
Valuable YoungAdults.
Note:The variable CustomerValuewasonlycollectedforcustomerswho were identifiedashaving
highprobabilityof churning.Customervalue wasn’tcollectedforcustomerswhohadlow probability
of churning.Assuch,these leadstoa distortedanalysisforcluster.However,since we don’thave
sufficientdemographical variables,we stilluse CustomerValue forthe clustering.
CustomerStatus basedProfiling:
To conduct CustomerStatusbasedsegmentation,we optforvariables whichhighlightwhatthe
statusof the customeriswithreference tothe servicesofferedbythe company. Email queriessent,
revenue throughGPRS,internet,&fix-lineanddayssince lastcomplain are the variableswhichare
selected. ThroughStatExplore we foundoutthe distributionof the latterfourselected variables
were highlyskewed, andthus we normalizethemusingTransformvariablesnode.
Figure 5: Process flow for Customer Status based Profiling.
Settingupof 4 clustersledtoan excellentcreationof fairlyequalsegments. The fourclusterscould
be interpreted as:
Cluster 1 – Superactive
Thisclusterischaracterizedbycustomerswhotendto conversate backand forthwiththe company
throughemailsquite oftenbuthaven’treallyhadacomplaintregardingthe servicesrecently.
Additionally,these customersgenerate arelativelyhigherrevenue throughinternet,GPRSaswell as
fix-line services.As such,the customersfrom these segments are very importantfromprofitability

4 |
P a g e
1 9 1 3 9 5 0 7
pointof view.
Cluster 2 – Curious
Customersfromthisclustercanbe describedasbeing quite curiousaboutthe new plans,asevident
fromtheirhighnumberof email queriessentinthe past6 months.Similarly,theyhave lodgeda
complaintveryrecentlyandproduce ahighrevenue throughthe internetmediumforthe company.
Thus,theyhave beenaptlynamedas‘Curious’. Thiswill needspecialattentionfromthe
organization,asitshowssignsof churning,
Cluster 3 – Content
Customersbelongingtothissegmenthave rathersatisfiedwiththe servicesandhave laidback
attitude.These customersdon’tgenerallysendinemailqueriesandhaven’tmade acomplaintwith
the companyrecently.The cashinflowgeneratedbythese particulars customersisidentical tothe
overall distributionof the customersacrossthe whole dataset.
Cluster 4 –Transitionals
‘Transitionals’representsaclusterof customerswhotransitioningtothe modernservicesofferedby
the company.Theyhave made a complaintfairlyrecentlybutdon’tgenerallysendmuchemailsto
the organization.The revenue generatedthroughinternetbythemisonthe lowerside butthey
produce highrevenue throughfix-linesandGPRS.
Days since lastcomplaintwasverysignificantvariablesforclusters‘SuperActive’ and‘Transitionals’,
while emailsquerieswerestrongdeterminantsforvariables‘Curious’and‘Content’(Appendix -
Figure 11).

5 |
P a g e
1 9 1 3 9 5 0 7
Usage based Profiling:
To conduct usage basedprofiling,we selectvariableswhichhighlightusage pattern –outgoing
national,international,roaming&local calls,change inbill andrevenue throughinternet andfixline.
Since these variableswere highlyskewed,we usedtransformvariablestonormalize their
distribution.
Figure 7: Process flow for Customer Status based Profiling.
Since we convertedall the variablesinlog,we setthe ‘Standardization’tonone.We setthe number
of clusterto4. The resultswere interpretedas:
Cluster 1 – Cosmopolitan
Thisclusterischaracterizedbycustomers whohave a highusage of outgoinginternational calls.Rest
of the usageslike national calls,local calls,roamingcallsandinternetforthissegmentissame asthe
patternof customersacrossthe dataset.Assuch, the companyshouldoffercustomersfrom this
clusterplanswhichmore attractive forinternational calling,if theywanttoretaintheminlong-run.
Cluster 2 – Connected
Customersfromthisclustertendtohave a highusage of outgoingcallsat national level.Theirusage
of otherservicesis prettymuchsimilartothe overall usage patternof the customers.Churning
customersfromthissegmentcanbe luredback by offeringthemvalue-for-moneyplansfornational
calling.

6 |
P a g e
1 9 1 3 9 5 0 7
Cluster 3 - Traditionals
Thisclusterisdescribedas‘Traditionals’sincetheirusage patternstaysthe same throughout,as
evidentfromtheirlowpercentage change inbills.Theirutilizationof nationalandinternational calls
stayson the lowerside thoughtheyuse ahighamount of internet.
Cluster 4 –Modern
Thissegmentisdescribedas‘Modern’asit ischaracterizedbythe usage of contemporarycustomers
– fluctuatingbills,low usage of calls(national,local,international &roaming) andhighinternet
usage.

7 |
P a g e
1 9 1 3 9 5 0 7
Figure 9: Variables' influence on each cluster.
Cross-clusteranalysis:
AftercreatingrespectiveclustersbasedonDemographics,CustomerStatusandUsage,we conducta
cross-clusteranalysistoexplore if there’s anyassociationbetweenthese segments;whichcould
potentiallybe harnessedintosomethingprofitable forthe company.
We addthe Save data node tothe Clusternodesandexportthe datafor segmentfromall three
categories.UsingVLOOKUPfunctioninexcel,we arrivedatthe followingobservation:
Demographic
ValuableYoungAdults Distressed Damsel StingySeniors Bankableladies
Usage
Cosmopolitan 28.79% 23.51% 21.85% 22.92%
Connected 22.06% 29.47% 23.69% 25.45%
Traditionals 25.63% 22.29% 27.66% 35.29%
Modern 24.59% 25.78% 25.23% 21.02%
Cross-cluster analysis:Demographic vs Usage
It was seen ‘Valuable YoungAdults’and ‘Cosmopolitan’sharedagoodassociation,indicatingthat
youngmentendto use international callingfrequently.Similarly,itwasobservedwomeninlatter
stages(‘Bankable ladies’) hadavery‘Traditional’usage i.e.theirbillsrarelyfluctuatedandtheyused
the callingfeaturesasmuchas the overall average. Lastly,DistressedDamselwascloselyrelatedto
‘Connected’,whichmeanstheyare prettyactive intermsof outgoingscallsnationally. Theseinsights
can be usedveryeffectivelytogaincompetitiveadvantage andimprove the offeringstothe
respective customers.A lotof businessvaluecanbe derivedbycorrectinterpretationandproper
actionsoverthem.
Cross-clustershighlightedinredrepresentthe group whichhave a high-chance of churning(derived
usingChurnFlagvariable andVLOOKUP).Assuch,it isimperative thatcompanyofferssuch
customersgooddiscountsandplans dependingupontheirusages soontoretainthem.

8 |
P a g e
1 9 1 3 9 5 0 7
Cross-clusteranalysisbetweenCustomerStatusandUsage helpedustodiscoversome hidden
insights.
Customer Status
Super
active Curious Content Transitionals
Usage
Cosmopolitan 22.37% 23.67% 26.39% 24.23%
Connected 33.42% 27.43% 22.13% 25.16%
Traditionals 23.92% 23.54% 30.03% 21.37%
Modern 25.41% 29.83% 24.22% 27.55%
Cross-cluster analysis: Customer Status vs Usage
‘SuperActive’customerstendtobe involvedinalotof interactioninternationally(‘Cosmopolitan’),
while ‘Curious’customershadaverymodern-like usage.Similarly,customerswhohave been
categorizedas‘Content’hada lotincommon with‘Traditional’usages.Assuch,the companies
shouldkeepthese insightsinmindandprepare planstomaximize profitoutof such groupof
customers.
On the otherhand,customersbelongingto‘SuperActive’clusterwithusage of ‘Traditional’have a
highprobabilityof churning.Additionally,‘Transitionals’withhighinternationalcallingusage and
national callingusage mayleave the companysoonersratherthanlater. Thus,the companyneedsto
dishout offersanddiscountsaccordingly,basedonthe usage patternsasmentionedabove,to
retainthose customers.
PART B – EXTENDING KNOWLEDGE OF PREDICTIVE ANALYTICS
---------------------------------------------------------------------------------------------------------------------------
Sevenreasonsfor Predictive Analytics
Since the turn of the millennium, andespeciallyinthe lastdecade orso,there has an unprecedented
generationof data,whetherstructuredorunstructured.Infact,IBMhas statedthat suchlarge of
volumesof dataisgeneratedevery day thatthe amountof data doublesupeverytwoyears.
In thisgiganticamountof data liesnumeroushiddenpatternsandtrends,whichif harnessedinthe
rightmannercouldresultinbusinessvalue of epicproportionsforthe organization. Predictive
analyticsisone suchtool thatcan exploitthese giganticdatato conjure upwithmeaningful and
actionable insights.
Eric Siegel,anaccomplishedheavyweightinthe fieldof PredictiveAnalytics,putforwardhis
thoughtsonPredictive Analyticsinawhite paper,statingpreciselywhythe worldneedstoembrace
Predictive Analytics. AsperEricSiegel,adoption,implementationandapplicationof Predictive
Analyticscanenable anorganizationtoachieve the followingsevenobjectives -
 Compete:Gaincompetitive advantageoverrivals.
 Grow: Increase sales,expandcustomerbase andretainexistingcustomers.
 Enforce:Detectfrauds,anomalies andundesirablecircumstances.

9 |
P a g e
1 9 1 3 9 5 0 7
 Improve:Enhancement&refinementin core productofferings,processautomationand
resourcesoptimization.
 Satisfy:Provide tailoredsolutions andrecommendationsforcustomers.
 Learn: Learningfromthe pastdata (structuredas well asunstructured) toprovide insights
and foresightsaboutthe future.
 Act: Actionable recommendations &insights.
Case Study II – Predictive Analyticsfor Insurers
Insurance company’s operatingsuccess chiefly reliesonitsforecastingcapabilities.The primary
distinguisherbetweenthe bestandthe restof the insurance companies isthe accuracy withwhich
the organization cantarget the potential customers,setthe pricingof the premiumanddetect
fraudulentclaims.Muchof these taskswere carriedouton the basisof guestimatesinthe olden
days;a methodwhichwasn’treallyefficientorcost-effective.
Soon,keydeterminantslike age andhistorybecame the foundationonwhichinsurance companies
forecastedits operations.However,today, Predictive Analyticshaschangedthe entire landscape of
howinsurance companiesconductedtheiroperations.
Withthe helpof PredictiveAnalytics,insurance companieshave notonlybeenabletoimprove their
core operations(e.g. Creditscores,frauddetection)butalsomarketingof the product(basedupon
buyingpatternsi.e.hitratio,retentionratio) andunderwriting(filteringoutcustomerswhodonot
meeta givencriteria,therebysavingtime andmoney).
RelatingCase Study II to sevenreasons for Predictive Analytics
The applicationof PredictiveAnalyticsisveryprevalentinthe insurance landscape;andisinfact
consideredasindustry bestpractise.The businessvalue thatcanbe derivedfromutilizationof
Predictive Analyticsinthe fieldof Insurance is tremendous. Afterthoroughlyanalysingthe given
Insurance case study,we couldsummarize how usage of PredictiveAnalyticsbyInsurance firms
enabledthemtoachieve the outcomesdescribedbyEricSiegel as:
Compete:
Insurance industryis verycompetitive,withcompaniesalwaysiteratingtostayone stepaheadof
the rivals.PredictiveAnalyticscanenable anorganizationtogatherknowledge aboutthe customers
ina more holisticmanner,whichcancreate a competitiveadvantage forthe firm.Similarly,
Predictive Analyticscancreate creditscore rating models,adverseselectionmodelsandsoon,which
will aidthe organizationtostayaheadof theirrivals.
Grow:
The insurance industryhaswitnessedsnail-pacedgrowthoverthe pastfew years.Thishasledto
organizationsexploringthe optionstoexpandtonew horizonsandlocations.Withthe helpof
Predictive Analytics,insurancecompaniescanpredictthe whichcustomersare likelytorespondto
offersandmarketingcampaigns. Similarly,throughPredictive Analytics,canunderstandthe buying
pattern,whichcan be usedformarketing’shitratioandcustomerretentionratios.
Enforce:
One of the mostsignificantfunctionforanyinsurance companiesis detectionof fraudulentclaims.

10 |
P a g e
1 9 1 3 9 5 0 7
Withthe helpof scoringandrankingmodels,Predictive Analyticscanhighlightwhichclaimsare a bit
suspiciousandneedmore investigationbefore settlement.
Improve:
Predictive Analyticscanimmenselyaidthe operating efficiencyandproductofferingof aninsurance
company. Throughpredictive models,insurerscanidentifyatthe initial stage itselfwhichclaimsare
likelytobe settledforhighvalue inthe future. Thiswill allow the companytorunits operations
more efficientlyandinamore economical manner.Additionally,Predictive Analyticscanfindout
whichcustomersmeetthe stipulatedobligationsforthe insurance andwhichcustomersdonot.This
helpsinsavingtime,moneyandresources of the organization.
Satisfy:
To maximize the customervalue,insurersneedtopitchthe righttype of insurance (lifeinsurance,
vehicle insuranceandsoon) to the customer. By observingthe buyingpatternsof the customers,
Predictive modelscansuggestthe rightfitof insurance individuallyforeachcustomer.Similarly,
Predictive modelscanassignariskscore foreach customerdependinguponvariousdeterminants
(age,location,history,etc.).These scoresthenenable the companytosetappropriate premium
pricingforthe customersaccordingly.
Learn:
Predictive Analyticsusessophisticatedmodelstofindoutpatternsandtrendsinthe dataset.As
such,usage of Predictive modelslike Linearregression,logitregression,decisiontreesandsoon can
enable the insurers tofindif anypatternexistsbetweenthe variables.Thisinformationcanbe used
for variousoperational activities.
Act:
The insightsandforesightsgeneratedbythe Predictive modelscanaddgreatbusinessvalue if they
are implementedbythe organization.Insurershave beenproactivelyactingonthe insights
producedbyPredictive Analytics.Frauddetection,customerretention,churnanalysis,adverse
selectionare some of the modelsthathave beencreatedthroughPredictivemodellingandbeen
actedupon bythe insurance companies.
Commenton sevenreasons for Predictive Analyticsand its relationwith Churn Case Study
The sevenreasonsof Predictive AnalyticsstatedbyEricSiegel addsdefinitevalue toPredictive
Analyticsproject.The steps mentionedby‘Dr.Data’ are comprehensiveanddescribe the benefits
that couldbe derivedfromaPredictive modelata veryminute level.
From the above Case StudyaboutInsurance,we couldobserve andrelate areal-life applicationof
the sevenreasonsforPredictiveAnalyticsandhow itprovedadvantageoustothe industry.
The sevenreasonsforPredictive Analyticscanalsobe witnessedinChurnCase study inparts.The
churn analysisenablesthe Telecomcompany togaincompetitiveadvantage (‘Compete’) overits
rivalsas itcouldact uponthe highchurn customersandretainthem (‘Grow’) byofferingthemoffers
and discounts (‘Act’) while theircompetitors whodon’tuse PredictiveAnalytics won’tbe able to
retaintheirhighchurningcustomers
The DecisionTree andRegression modelswere builtusingpastdata(‘Learn’).The DecisionTree

11 |
P a g e
1 9 1 3 9 5 0 7
model wasthenused onthe new datasetwiththe helpof Score node todetectwhichcustomersare
on the verge of churning(‘Enforce’).
Eventhoughthe model flagscustomershavinghighprobabilityof churn,the case studydoesn’t
reallyfollow ‘Improve’ asthe model doesn’tenhance the core productofferingbutjustindicate
whichcustomersmaybe unhappywiththe services.Similarly,the case studydoesn’tfollowthe
‘Satisfy’ asit cannotsuggesttailoredsolutionstoindividual customersbutcan onlysuggestwhich
customersshouldbe offeredadiscounttoretainthem.
SEMMA
SEMMA (Sample,Explore,Modify,Model andAssess) isa methodologyformulatedby SASinstitute,
to conductany data miningtasksonits software, SASEnterprise Miner. SEMMA isconcernedwith
the model developmentaspectsof data-mininginSASMiner,anditsadherence ensuresend-to-end
coverage of the core data miningprocesses;whichdirectlyleadstomore informedandaccurate
analysis.
However,due tolackof concrete approachestowardsdatamining processflow (otherthanCRISP-
DM), SEMMA isfollowedbymanyanalyststoconductdata miningactivities. SEMMA standsfor-
Sample: Everydata miningactivityshouldstartwithsamplingof the datasetintotraining,validation
and testsets,ensuringthere’senoughinformationtocarry all these tasks.
Explore:In thisstage,we investigateandexplorethe variablestodiscoverinformationandpatterns
that may existbetweenthe variables.
Modify:Atthisstage,we selectappropriate methodstomodify,transformandrectifyvariablesthat
wouldbe usedinthe modelling.
Model: Afterexplorationandmodificationof variables,we applythe modelling technique onthe
selectedvariables.
Assess: At the lastphase,we evaluate the accuracyand predictingcapabilitiesof the models.
Relationto ChurnCase Study
SASproposedthatSEMMA isthe core processof conductinga data miningactivity. Itcanbe
observed fromFigure 10, the churn case studyreligiouslyfollowedthe SEMMA principles. The Churn
analysiscommenceswith DataPartitionnode (Sample),whichenablesustocreate sample fromthe
datasetand allocate sufficientenoughdatafortraining,validationandtest. Thisisthenfollowedby
imputationof missingvaluesandreplacementof variabletoreduce its numberof classes(Modify).
To reduce the redundancy,we utilizethe Variable Clusteringnode (Explore) andthenrunour
DecisionTree andRegressionmodels(Model).ThroughModel Comparison(Assess),we compare the
twopredictive modelsandfindoutsomethingpeculiar.Toinvestigateitfurther,we use Multiplot
node (Explore) anddetectabnormal variableswhichaffectedthe predictivecapabilitiesof the
model.

12 |
P a g e
1 9 1 3 9 5 0 7
Figure 10: Process flow of Churn Case Study
Usingmetadata(Modify),we remove theseabnormal variablesandre-connectthe Decisiontree and
Regressionmodels(Model) toit. Then,we againuse the Model Comparisonnode (Assess) togauge
whichmodel outperformsthe other.Finally,we use the Score node (Assess)toapplythe bestmodel
to the newdatasetand complete the dataminingprocess.
Thus, it couldbe concludedthatall the stepsof SEMMA were comprehensivelycoveredbythe
Churncase study.
The adherence of SEMMA inthe Churn Case studycan be summarizedas:
Steps Nodes
Sample Data Partition
Explore Multiplot, Variable Clustering
Modify Impute, Replacement, Metadata
Model Decision Tree, Regression
Assess Model Comparison, Score
Relating SEMMA with Churn Case Study.
Importance of SEMMA
EventhoughSASinsistsSEMMA is merelyasetof guidelinestobe followedforSASminer,the
methodology’sapplicationcanbe extendedtodataminingtasksasa whole.SEMMA is a veryrobust
approach thatencompassesall the chief criteriarequired forundertakingorbuildingacomplex
predictive model.Adherence of SEMMA ensuresease of processflow,detectionof faultsand
creationof more accurate models.
ChurnCase studyhugelybenefittedbyfollowingthe SEMMA methodology. ThroughMultiplotand
Variable Clusteringwe could ‘explore’ erroneousvariablesandredundantvariablesandthrough
impute,replacementandmetadata,we could ‘modify’ suchvariables.Model Comparison enabledus
to compare,contrastand ‘assess’ the twopredictive ‘models’ –DecisionTree andRegression.With
the helpof Score node, we evaluatedandappliedthe model toanew dataset.
Sample
Samp
le
Modify
y
Samp
le
Sample
Model
Samp
le
Explore
Samp
le
Assess
y
Sam
ple
Sample

13 |
P a g e
1 9 1 3 9 5 0 7
Appendix
Figure 11: Most significant variables for each of the four clusters.

Assignment 2: Cluster Analysis and Predictive Modelling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Assignment 2: Cluster Analysis and Predictive Modelling

Similar to Assignment 2: Cluster Analysis and Predictive Modelling (20)

More from Siddhanth Chaurasiya

More from Siddhanth Chaurasiya (6)

Recently uploaded

Recently uploaded (20)

Assignment 2: Cluster Analysis and Predictive Modelling