Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ije...
Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ije...
Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ije...
Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ije...
Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ije...
Upcoming SlideShare
Loading in …5



Published on

IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1085-10891085 | P a g eGene Expression Programming Based Dataset Decoration forImproved Churn PredictionSilvia Trif*, Adrian Visoiu***(Department of Doctoral Studies of Academy of Economic Studies, Bucharest, Romania)** (Department of Economic Informatics and Cybernetics, Academy of Economic Studies, Bucharest, Romania)ABSTRACTMobile network operators rely on businessintelligence tools to derive valuable informationregarding their subscribers. A key objective is toreduce churn rate among subscribers. The mobileoperator needs to know in advance whichsubscribers are at risk of becoming churners. Thisproblem is solved with classification algorithmshaving as input data derived from the largevolumes of usage details recorded. For certaincategories of subscribers, available data is limitedto call details records. Using this primary data, adataset is created to be conveniently used by aclassification algorithm. Classification qualityusing this initial dataset is improved by a proposedmethod for dataset decoration. Additionalattributes are derived from the initial datasetthrough generation, based on gene expressionprogramming. Classification results obtained usingthe decorated dataset show that the derivedattributes are relevant for the studied problem.Keywords - business intelligence, churn prediction,classification, data mining, gene expressionprogrammingI. INTRODUCTIONNowadays, due to competition, one majorchallenge for service suppliers is to keep theirexisting position on market, if not gaining moremarket share. For the supplier, this creates thenecessity to find ways to better address customerneeds. The supplier has to understand the customer inorder to model its behavior. Since quantitativeinformation is objective and may be subject to furtherprocessing, any available customer data is collectedfor further analysis. This is also supported by theexistence of information systems and data inelectronic format. Datasets are created and fed tobusiness intelligence specialized tools in order toobtain relevant information to support the decisionmaking process. This processing flow enables a rangeof applications such as: churn prediction, suggestionengine, customer segmentation. Business Intelligencetools are not standalone, but integrated with theinformation system of the organization, such asCRM, as shown in [1]. Some applications of theintelligent engine consist in using BusinessIntelligence functionality directly from anothersystem, without human intervention, like the platformfor secure USSD service presented in [2]. Theinteractions between Business Intelligence tools andother systems may become complex. A referencecomplex churn prediction system is presented in [3].II. PROBLEM FORMULATIONOne of the problems service providers faceis related to preventing churning. Churning occurswhen a particular subscriber stops using the servicesprovided by the supplier. The reasons for churningare various and the service provider may not havesufficient data to infer them. However, statisticalanalysis is able to discover patterns associated tochurning behavior. When performing the analysis,past data is analyzed and patterns or rules are derivedto identify churners. Statistical methods are used toestimate the churn risk for each subscriber aspresented in [4]. When using in production, forexisting subscribers, if they match churning patternsor rules, they are considered as potential churners andthe service provider is able in advance to takepreventive measures to stimulate the subscriber.Preventive measures include bonuses, discounts, giftsand other incentives that are likely to convince thesubscriber not to churn and continue the relationshipwith the service provider [5].In the following, we consider the case of atelecom operator as a service provider. Telecomoperators have: a large number of subscribers of twomain types – prepaid and postpaid - and a largenumber of records regarding service usage, stored infiles or databases. An important step in the creationof the dataset is gathering and transforming primarydata to the format needed by the algorithm used inanalysis. Standard approaches to data mining includethe usage of data warehouses as presented in [6].Churn analysis results quality is influenced by certainfactors: available data in terms of volume and detailsor attributes and algorithms used.Regarding classification algorithms, thereference algorithm used in analysis is Naïve Bayes.
  2. 2. Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1085-10891086 | P a g eSeveral other algorithms fit to solve this problem aredecision trees and logistic regression as presented in[7].Regarding available data at telecomoperators, for both prepaid subscribers there are calldetail records showing when and how the serviceswere consumed. There are also differences betweenprepaid and postpaid subscribers regardingsupplemental information regarding the subscriber.Postpaid subscribers consume their service based ona sign contract and therefore the operator has accessto identification data such as name, gender, age andsimilar. This kind of information is important inchurn analysis as it represents additional attributeshelping the algorithm to give better results. On theother hand, prepaid subscribers are anonymous forthe service provider. The only identification is thephone number – MSISDN - they have on the prepaidSIM card. Therefore, churn prediction is a lot moredifficult for prepaid subscribers for which onlyservice consumption is known. Based on the serviceconsumption, for each MSISDN, service usage issummed up for a number of equidistant andconsecutive periods or time in order to build theinitial dataset. A record has the form of MSISDNi,USGi1, USGi2, …, USGij, …, USGik, STATE,where:- MSISDNi is the identifier of the prepaidsubscriber in the system- USGij is the amount of service consumed bysubscriber i during period j- STATE is a Boolean value showing if thesubscriber was a churner, based on the amountconsumed by the subscriber i during the k+1period, if USGik+1>0 then the subscriber is“active”, if USGik+1=0 then the subscriber is“inactive”The set of records defined above constitutesthe initial dataset. In this context, the churnprediction analysis aims to estimate the value ofSTATE variable depending on USG1, USG2, ..,USGk. This is treated as a classification problem andclassification algorithms are applied. However, it isobserved that simply employing the initial dataset foranalysis leads to poor classification results as shownin [8]. Dataset decoration is needed to add relevantattributes that would improve the quality of theresults. In case of poor information available,decoration is done using derived attributes that areobtained by transforming the initial attributes suchway to:- Eliminate unnecessary correlations betweeninitial attributes- Improve the dependency between thederived attributes and the studied variable- Create meaningful derived attributes, easy tocompute and to interpret- Gain more information from the derivedattributes than from initial attributesIn [8] a transformation for creating derived attributesbased on predefined basic statistical indicators ispresented. In [9] more attributes are derived byanalyzing the call graph. In our research weconsidered only the above defined consumptionindicators.In the following we investigate a free transformationof the above initial dataset based on expressiongeneration using a dedicated evolutionary algorithm,the gene expression programming.III. GENE EXPRESSIONPROGRAMMING BASED DATASETDECORATIONIn the following, the simple transformationsmade on the initial attributes are extended to a moregeneral approach. Using simple operators likeaddition, subtraction, multiplication, division as wellas operands taken from the attribute set, then a largenumber of expressions may be obtained. As thenumber of expression forming combinations is verylarge, and generating all the possible combinations istime consuming, a solution taken into account toobtain a valid result in a short time is to use geneexpression programming as a mean of generatingderived attributes based on the set of initial attributesand a set of operators to be applied.Gene expression programming is anevolutionary algorithm based on the principlesdescribed in [10]. A population of chromosomesevolves through applying genetic operatorsiteratively. Genetic operators include: elitism –survival of the best chromosomes from a generation,crossover – exchange of genetic material betweentwo chromosomes, mutation – random change insideexisting chromosomes. Chromosomes encode,symbolically, individuals representing solutions of aproblem to be solved. The design and preparation ofthe genetic algorithm in general, and of the geneexpression programming, in particular, requires thatspecific concepts from the evolutionary algorithmneed to be put in correspondence with elements of theproblem to be solved.Structurally, chromosomes specific to geneexpression programming are made up of one or moregenes. Genes are made up of symbols. Chromosomesencode expressions made up of operands andoperators. Due to flexibility needs, chromosomesallow the encoded expression to be obtained bycombining several sub-expressions encoded bygenes, such way each gene encodes a sub-expression.
  3. 3. Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1085-10891087 | P a g eA symbol encodes an operand or an operator. Sincethe expressions encoded by genes are intended to beevaluated, there are specific genetic expressionprogramming rules to correctly form a gene. A geneis made up of a head having symbols encoding bothoperators and operands, and a tail which is made uponly of operands. The length of the head and thelength tail take into account the maximum number ofoperands needed by the operators taken into account,such way the syntax tree associated to the expressionto be well formed, as presented in [11].In our research, we aim to obtain a datasethaving attributes derived from initial attributes, underthe form of analytical expressions made up ofoperands and operators. Taking into account thestructure of the gene expression programmingchromosome, we propose the following mappingbetween gene expression programming concepts andour problem. Each chromosome encodes the list ofderived attributes. Each gene of the chromosomeencodes the expression of the derived attribute. Eachsymbol encodes an operator from a predefined set oran operand taken from the set of initial attributes. Inthe following, we consider the initial datasetcontaining only usage attributes: USG1, …, USG4..Table 1 presents a gene encoding for the expression,while Fig. 1 shows the actual syntax tree inferredfrom the encoding, in a top-down, left-to-rightfashion.E= (USG3-USG2)/USG1Gene:Table 1 A gene encodingFig. 1 Syntax tree for the geneFor example, a two-gene chromosome, with geneshaving a head length of 2 positions and a tail lengthof 3 positions is presented in Fig. 2:Chromosome:Fig. 2 Two gene chromosomeThe preparation of the gene expression programmingapproach requires:1) Setting up the list of operators, OT, and thelist of operands, OD. For example, the list ofoperators, OT is OD={+,-,*,/} and the list ofoperands, OD is OD={USG1,USG2, USG3, USG4}Crossover operation that occurs between tworandomly chosen chromosomes enables the exchangeof genetic material in order to obtain new individualswith new characteristics. Crossover occurs at severalpositions inside the chromosome. The crossoverbetween two single gene chromosomes at a certainposition is presented in figure below:Chromosome 1, encoding the expression (USG2-USG3)+USG1 is presented in Fig. 3:Fig. 3 Chromosome1before crossoverChromosome 2, encoding the expression(USG3+USG4)-USG2 is presented in Fig.4:Fig. 4 Chromosome2 before crossoverAfter crossover at position 3, chromosome 1’,encoding the expression: (USG3-USG4)+USG1 ispresented in Fig. 5:Fig. 5 Chromosome1’ after the crossoverChromosome 2’, encoding the following expression(USG2+USG3)-USG2, is presented in Fig. 6:Fig. 6 Chromosome 2’ after the crossoverThe syntax trees associated to the initialschromosomes before and after crossover arepresented in Fig. 7 and Fig. 8, respectively:Fig. 7 Expression trees before crossover,corresponding to chromosomes 1 and 2, respectively.
  4. 4. Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1085-10891088 | P a g eFig. 8 Expression trees after crossover, correspondingto the chromosomes 1’ and 2’. respectively.This kind of crossover may happen in severalpositions, affecting gene contents in both single-geneand multiple-gene chromosomes. This enables theevolution of expressions of the derived attributesthrough the implied changes.A special case of crossover is when such exchange ofgenetic material occurs only at positions pointing tothe beginning of genes, inside a multiple-genechromosome. In this case, entire attributes encodedby genes will be exchanged, as shown in Fig. 9 andFig. 10, respectively.Fig. 9 Multi-gene chromosomes before crossover atgene positionAfter crossover at position 2 the new chromosomesare the ones presented in Fig. 10:Fig. 10 Multi-gene chromosomes after crossover atgene positionEntire gene crossover enables the evolution ofderived attributes lists, through the change theyimply. Both of these two kinds of the same geneticoperator may be applied to the chromosomepopulation independently.Mutation is a sudden change in the content of a gene,randomly changing a symbol. For example, Fig. 11shows a single gene chromosome before the mutationof an operator that changes the encoded expression.Chromosome 1: (USG2-USG3)+USG1Fig. 11 Chromosome 1 before mutationAfter mutation at position 1, changing minus signwith plus, the chromosome becomes as presented inFig.12:Chromosome 1’: (USG2+USG3)+USG1Fig. 12 Chromosome 1 after mutationUsing the proposed gene expression programmingapproach, an initial population of multi-genechromosomes has been evolved in order to obtain alist of derived attributes corresponding to the bestchromosome after a threshold number of iterations.The performance criteria was the quality of theclassification with respect to each attribute encodedby a chromosome.The list of derived attributes obtained using geneexpression programming is shown in table 2.Table 2 The list of attributes derived using geneexpression programmingClassifying the subscribers using the whole deriveddataset, the percent of correctly classified instances is58.56% while the incorrectly classified instances are41.13% of the set, as obtained from table 3.Table 3 Correctly and incorrectly classifiedsubscribers using the list of derived attributesThis is an improvement over the results obtainedusing the initial usage attributes. Furtherimprovement is made by generating combinations ofattributes and asses the quality of the classificationfor each individual combination. The combination ofattributes having the best classification quality ischosen. Table 4 shows the derived attributes chosenas best combination in our case study.Table 4 The combination of derived attributes that isbest for the classificationUsing the dataset defined by the attributes from table10, the correctly classified instances are 71.50% ofthe total, while incorrectly classified instances areonly 28.50%. The classification details are presentedin table 5.
  5. 5. Silvia Trif, Adrian Visoiu / International Journal of Engineering Research and Applications(IJERA) ISSN: 2248-9622 www.ijera.comVol. 3, Issue 3, May-Jun 2013, pp.1085-10891089 | P a g eTable 5 Correctly and incorrectly classifiedsubscribers using the best derived attributesIt is also important that the chosen derived attributeshave meaningful expressions with respect to thedomain:- (USG3+USG4)/USG1 shows the amount ofusage for the last two weeks over the usage duringinitial week of the study; this is a relative indicatorshowing how many times the usage was higher withrespect to the baseline- (USG2+USG3)/USG1 is also a relativeindicator showing how many times the usage washigher in a past period with respect to the baseline- USG4-USG3 shows the absolute differencein consumption during consecutive periods.Therefore, the generated attributes have statisticalmeaning, being easy to understand and interpret.They are also selected from a large set of generatedexpressions otherwise hard to assess without theconvergence of evolutionary approach.As seen, gene expression programming is a powerfulmethod to use in order to obtain derived attributesfrom the original dataset.IV. CONCLUSIONDataset decoration is a useful technique forimproving the quality of classification when poordata is available. We have proposed a datasetdecoration method based on gene expressionprogramming. For churn prediction analysisattributes derived using the proposed method arerelevant and their use for classification bringssignificant improvement in the results of theclassification.ACKNOWLEDGEMENTSThis work was co-financed from theEuropean Social Fund through Sectoral OperationalProgramme Human Resources Development 2007-2013, project number POSDRU/107/1.5/S/77213„Ph.D. for a career in interdisciplinary economicresearch at the European standards”.REFERENCES[1] A. Tudor, A. Bara, I. Botha, Data MiningAlgorithms and Techniques Research inCRM Systems, Proc. 13th WSEASInternational Conference on MathematicalMethods, Computational and IntelligentSystems, Iasi, Romania, 1-3 July 2011, 265-270.[2] S. Trif, A. Vișoiu, USSD based one timepassword service , Proc. 5th InternationalConference on Security for InformationTechnology and Communications 2012,Bucharest, 2012, 141 -149.[3] S. Yuan-Hung, David C. Yen, Hsiu-YuWang, Applying Data Mining to TelecomChurn Management, Expert Systems withApplications, 31(3), 2006, 512-524.[4] R. J. Jadhav, Churn Prediction inTelecommunication Using Data MiningTechnology, International Journal ofAdvanced Computer Science andApplications, 2(2), 2011, 87-110.[5] J. Sathyan, M. Sadasivan, Next GenerationMobile Care Solution , Proc 10th WSEASInternational Conference onTelecommunications and Informatics,Lanzarote, Canary Islands, Spain, May 27-29 2011, 265-270[6] D. Camilovic, D. Becejski-Vujaklija, N.Gospic, A Call Detail Records Data Mart:Data Modeling and OLAP Analysis,Computer Science and Information Systems,6(2), 2009, 17-19.[7] Li-Shang Yangi , C. Chiu, KnowledgeDiscovery on Customer Churn Prediction”,Proc. 10th WSEAS Interbational Conferenceon Applied Mathematics, Dallas, Texas,USA, November 1-3, 2006, 523-528.[8] S. Trif, A. Visoiu, Improving ChurnPrediction in Telecom through DatasetDecoration , Proceding 7th WSEASInternational Conference on ComputerEngineering and Applications (CEA 13),9-11 Ianuarie 2013 Milano, RecentResearches in Information Science andApplications, Recent Advances in ComputerEngineering Series, Vol. 9 , WSEAS Press,223 – 228[9] K. Dasgupta, R. Singh, et al., Social Tiesand their Relevance to Churn in MobileTelecom Networks, Proc. 11th internationalconference on Extending databasetechnology: Advances in databasetechnology, 2008, 668-677.[10] B. C. Ferreira, Gene ExpressionProgramming: Mathematical Modeling byan Artificial Intelligence 2nd Edition(Springer Publishing, May 2006).A. Visoiu, Deriving Trading Rules Using GeneExpression Programming, Informatica Economica,15(1), 2011, 22-30, ISSN 1453-1305