1.
COMP 7570 –Neural Networks Project ReportNeural Network Classification and its Applications in Insurance Industry Inderjeet Singh 7667292 Department of Computer Science University of Manitoba December 8, 2011
2.
AbstractNeural networks when used for classification also known as neural classifiers have manyadvantages. Extracting rules from these trained networks is a hard task. Research has beendone in this regard. [Lu] generated a method of extracting the rules from neural networks andadvocated the use of neural networks in the process of classification and data mining ingeneral. [Smith] did a case study of the use of neural networks for customer retention in theinsurance industry. They discussed the importance of predicting the patterns of the customerterminations for gaining profit in this highly competitive industry. [Viaene] deployed neuralnetworks for predicting the claim frauds in automobile insurance industry. Input (fraudindicators) relevance is important for detecting the claim frauds. They used neural networks(MLP-ARD) to produce the fraud indicators importance rankings for automobile insuranceindustry.1. IntroductionNeural Networks [Scuse] are models of intelligence that consist of large numbers of simpleprocessing units also known as neurons or nodes that collectively are able to perform verycomplex pattern matching tasks. These models perform stimulus response (input-output)mapping. Classification which is a branch of data mining, [Wiki] is the process of learning rulesor models from training data to generalize the known structure and then to classify new datawith these rules. 2
3.
Normally in data mining field classification happens with the help of decision tree algorithmsand logistic regression. These days’ neural networks are also used as one of the approaches forclassification. Classification with neural networks is a popular area of research. It has gained alot of attention specifically in the field of data mining where the volume of data is too large tohandle.Neural networks when used for classification have many advantages. They are data driven, self-adaptive. They can approximate any complex function with high accuracy. They can be used tomake non-linear models which can model real world applications with high accuracy. They arealso tolerant to noisy data. Neural classifiers have problems as well. They usually lacktransparency and have black box behaviour; there learning or training time is long whichdepends upon many repeated epochs cycles over the training data. Also, extractingclassification rules [Lu] is difficult from neural networks because of their complex andincomprehensible structure with too many links between input, hidden and output units.Neural networks have already been used in real world application such as bankruptcyprediction, credit scoring, quality control, insurance industry, handwriting recognition andmany more. In this report I will focus specifically of their application in insurance industry.Insurance industry is a very competitive industry. The success of an insurance companydepends upon the profit and growth. Profit depends upon various factors. Predicting theaverage claim cost, frequency of claims and to examine the effect of change in prices of policiesor premium cost on the customer retention [Smith] is critical for profit. Neural classification has 3
4.
been applied in this regard to learn and predict if a customer will terminate or renew his policy.Claim fraud is another important issue in this industry. Companies are facing huge losses ofmoney from fraudulent claims made by the insurers. They are looking for solutions for fraudclaim prediction and diagnosis. Neural classification [Viaene] help to know which fraudindicators or inputs are most crucial for predicting fraudulent claims. Both the above uses ofneural networks used different version of multilayer feed forward neural networks.2. Extracting Symbolic Classification Rules from Neural NetworksIn this work, [Lu] is focussed on mining classification rules from large databases with the help ofneural networks. Neural network approach has advantages like, low classification error rate androbustness to noise.The neural network based classification approach described by them consists of three phases:The first phase is network construction in which a three layer feed forward neural network isconstructed. The method for creating neural network is inspired by [Sentiono 1995] method ofdynamically creating the network. Network creation starts from a single hidden unit and thendynamically adds hidden units to network until network completely classifies all the inputpatterns correctly. Rather than minimizing the sum of squares of errors [Sentiono 1995]maximizes the likelihood function. Also, unlike back propagation method this method does notget stuck in local minima.The second phase is network pruning. In network pruning the penalty function [Sentiono] isadded to error function that helps to prune the network by weight removal. The penalty 4
5.
function used in the above approach is sum of squared weights. While pruning the network theclassification error rate should not increase. The first objective while pruning is to discouragenonessential connections and second is to prevent the connection weights from attaining largevalues. Removing unnecessary weights from the network reduces the networks complexity.The last phase is the rule extraction from the pruned network. Extracting rules in not easy, asthe number of links from the pruned networks is still too much to define the explicitrelationship in terms of if-then-else rules. Also, it is difficult to derive clear relationship betweencontinuous activation values of hidden units and output units. Rule extraction from a prunednetwork consists of four steps: 1. Use of clustering algorithm to find clusters of hidden units’activation values. 2. Enumerating the hidden unit activation values and computing the outputs.Generate the rules that describe the network output in terms of the hidden unit activationvalues. 3. For every hidden unit enumerate the input values that lead to them and generate theset of rules that describes the hidden values activation values in terms of input units. 4. Laststep is to merge the rules obtained in previous two steps to obtain rules that map inputs tooutput.They explained their approach of rule extraction on one of the 10 classification problems orfunctions used earlier in research. They chose function 3 to demonstrate their approach.Function 3 looks is shown in Figure 1. 5
6.
Figure 1 : Function 3 [Lu]To solve this classification problem represented by function 3 they created the neural networkas described in network creation phase above. They used the people database consisting ofnine inputs such as salary, commission, age, elevel, car, zip code, house-value, house-years andloan and one output representing the class. Input tuple can belong to group A or group B. Theinputs were represented as binary string of 0 and 1’s. The respective bits of the input string are0 and 1 depending upon where subinterval the value of input is located. With the above binaryscheme for inputs there were a total of 37 binary inputs units (shown in Fig 11), values of 9inputs plus one input unit for bias making a total of 38 units. The non-pruned network consistedof six hidden units and one output units. Therefore, it consists of 234 links. The training dataset,they used has 2000 tuples of these inputs. Network pruning is performed as described abovegiving a much simpler network as shown in Fig 2. This pruned network only consists of twohidden units and six input units. 6
7.
Figure 11: Coding of the attributes of the neural network inputs [Lu]Before extracting the rules from pruned network shown in Fig 2, all the four steps describedabove are executed. The activation values of its two hidden units are clustered. The clusters arecentered on 0.46 and 0.81. This results in two clusters of discretized activation values. For firsthidden unit, input tuples are split into two groups one with activation values of [-1, 0.46) andother with values between [0.46, 1). For second hidden unit, input tuples are split in same wayin groups of [-1, 0.81) and [0.81, 1). The activation value of patterns for two hidden units j=1 or2 is represented by =1 or 2. Value of = 1 on a hidden unit means that input tuple belongsto group A and value of =2 means input tuple belongs to group B. For input to be classified ingroup A, either or should be equal 1, otherwise input is classified in group B.To generate rules for each hidden unit that do not involve weights [Lu] used the X2R algorithmthey developed earlier. The rules they got for the two hidden units are combined to give therules for final output in term of inputs units. For function 3, they extracted a total of 5 ruleswith a total of 10 conditions from the pruned network. These rules are shown below. The rules 7
8.
they got can then be expressed in terms of actual input attributes of age and elevel for function3. Else Default rule. Group BFor evaluation and analysis, they compared their approach of extracting rules from neuralnetworks with the decision tree classifier (C4.5) approach. Test for the neural networkclassification was done on eight functions similar to function 3 described above. Randomnumber generation was used to develop the dataset for testing the rules generated fordifferent functions or classification problems. They used three fold cross validation to estimatethe classification accuracy of the generated rules. Fig 3 shows the results they got afterevaluation of the quality of rules generated by neural networks for different functions. Theyfound that neural classifiers generate much fewer rules than decision trees algorithm C4.5,shown in Fig 4. The accuracy and number of conditions per rule for different functions werecomparable for both appraoches.They concluded that efforts can be made to make neural classifiers training fast. In this regard,they suggested incremental training and rule extraction from the database. 8
9.
Figure 2: Pruned network for Function 3 [Lu]Figure 3: Averages of accuracy rates, the number of rules and the average conditions per rule obtained [Lu] 9
10.
Figure 4: The number of rules extracted from neural networks (NN) and C4.5 algorithm (DT) [Lu]3. Neural Network Applications in Insurance Industry 3.1. An Analysis and Prediction of Customer Retention Patterns and PricingThe problem of concern in insurance industry is to set the pricing to match the claim costs andyet to retain the existing customers and also acquire new ones. There have been a lot ofresearch in this regard, but due to competitiveness of this industry hardly any result or methodsto solve the above problem gets published. 10
11.
In this case study, [Smith] works on structured problem of customer retention modelling usingregression, decision trees and neural networks also known as supervised learning methods. Themethods are used to learn the relationships between variables (inputs) and decisions (outputs).They also study, the unstructured problem of analysis of claim patterns using clustering which isan unsupervised learning method. In this report, I will discuss more about the first problem ofcustomer retention using neural classification, which is the main focus of this projectGrowth of an insurance company depends upon attracting new customers and retaining theexisting ones. The renewal or termination of policy by customer depends upon premium price,service, personal preference, insured amount, convenience and many other factors. Theanalysis of customer retention in this case study involves two goals: First, to know the reasonsof policy termination and second, to develop a tool (based on neural classifier) for predictingthe likely policy termination. This tool will help in analyzing the impact of changes of premiumcosts of policies on the likely terminations of customers. Identifying the likely policy terminatingcustomers can aid in the direct marketing campaigns.To analyze the customer retention patterns, [Smith] obtained the data of 20914 auto policyholders whose policies are going to expire in April 1998. The dataset included details such asdemographic information (age group, postcode .etc.), policy details (premium, sum insuredetc.) and policy holder history (rating, years on rating, claim history, etc.) as shown in Fig 5below. Among this dataset, 7.1% of policy holders did not renewed their policies and theirpolicies terminated. Through meetings with insurance company [Smith] found that, premiumprice and sum insured played a major factor in likely policy terminations. 11
12.
They used the SAS Enterprise Miner software for evaluation. SAS Enterprise Miner is widelyknown GUI based commercial software for applying data mining techniques. The setup for thisparticular experiment involves different levels. At the first level is data processing (variableselection, data transformation and data partitioning), then second level is application of datamining techniques (clustering, regression, decision trees, and neural networks) and last level isthe analyses (assessment, bar charts). The process flow diagram is shown in Fig 6. In datatransformation they normalized and log transformed the variables. After transformation isapplied, they got a total of 29 independent inputs and one output (dependent variable ortermination yes or no decision), shown in Fig 5.Regression, decision tree and neural network (available in SAS software) methods were usedfor making three separate classification models or classifiers. These classifiers will predict thelikely terminations or renewals of policies. Three layer multilayer feed forward neural networkwith 29 inputs units, 25 hidden units and single output unit is used. The units used hyperbolictangent activation function. Default learning rule which uses multiple Bernoulli error function isused. The error is minimized by using a conjugate gradient technique and by changing theweights.All three methods are executed on the test set to classify the likely terminating policies. Thetest set consists of 20% of entire dataset and is ranked in descending order of the likelihood ofpolicy holders terminating their policy. Fig 7 shows the lift chart comparing the performance ofall three methods in classifying the policy holders as terminating. Lift chart measures theeffectiveness of the predictive model and the area under the lift curve indicates how accurate 12
13.
the predictive model is. X-axis in chart depicts the percentage of the policy holders selectedfrom the ranked list of test set and Y-axis depicts the percentage of likely terminatingcustomers from the percentage policy holder selected above. As can be seen in Fig 7 the whiteline or lift curve representing neural networks has the largest area which means it classifiesmost of the terminating policies. If only 10% of the policy holders are selected and ranked inorder of likely terminations predicted by the neural network model, 50% of the predictedterminations are correct. With regression and decision tree this accuracy is only 40% and 28%.Effect of decision threshold on the number of policies classified as terminated by the network isalso determined. If this decision threshold is set to 0.5, the policy is classified as terminated iflikelihood or probability of a policy predicted by neural network is above 0.5. It is observed thatsetting a low value of 0.1 for this decision threshold helps in predicting all likely terminations.Marketing mails can be sent out to these likely terminations, to help them renew their policies.But low decision threshold results in loss of accuracy in predicting terminations. It is good tokeep the decision threshold high (high accuracy), if the premiums are being changed for policyholders who are most likely predicted to terminate their policies. This ensures that premiumchanges are made for only likely terminating customers.Misclassification costs can be decided for generating a profit loss matrix. For example, if thepolicy holder is classified as likely termination but he renews the policy, the misclassificationcost will be the discount offered to him as a bait to renew his policy. On the other hand if thecustomer is not predicted as a termination and he actually terminates his policy, 13
14.
misclassification cost will be loss of his premium for the next year. The optimal value of decisionthreshold needs to be determined to minimize misclassification costs and maximize profits.Pricing the policies is the tricky part. The pricing of policies occurs in four steps: prediction ofclaim costs, identification of the right premium price to gain profitability, analysis of thecustomer retention patterns considering the difference between old and new premiums, andfinally adjustment of these premiums to retain the customers and while still making profits.These four steps are executed every time before marketing mails are going to be sent out tothe likely policy terminating customers. The new price of the policy could not suit the customerand he may decide to terminate the policy. The data with new policy price together with thedifference of price is fed into the neural network model to predict the likely terminations withnew policy prices. The prices can then be adjusted to balance the goals of profitability andcustomer retention. Optimal pricing is an iterative process with a goal of finding a balance. Figure 5: Total of 29 inputs attributes [Smith] 14
15.
Figure 6: Process flow diagram for customer retention classification [Smith] Figure 7: Lift Chart showing percentage of policy holders classified for likely termination vs.percentage of policy holders selected from the test dataset. It shows the performance comparison for classification techniques such as regression, decision tree and neural networks [Smith] 15
16.
3.2. Auto Claim Fraud Detection using Bayesian Learning Neural NetworksCompanies face a huge loss of money for fraudulent claims made by the insurers. Insurancecompanies are looking for solutions for fraud claim prediction and diagnosis. These days theyare using tools that rely on neural networks and artificial intelligence to solve this problem.Neural networks help in making general and scalable parameterized, non-linear mappings ofinputs and outputs. But there are also some problems with them, such as what weights to setbefore training starts, how to avoid fitting the noise in training data which makes them difficultto implement. The above issues are mostly solved by using the ad-hoc ways.In this paper, [Viaene] have used Bayesian learning to deal with above issues while training theneural networks. Bayesian learning learns the model in a step by step manner rather than ad-hoc. [Viaene] explores predictive powers of Multi-Layer Perceptron (MLP) based neuralnetwork classifiers trained with the help of [Mockay] evidence framework approach to Bayesianlearning which is used to optimize an automatic relevance determination (ARD) objectivefunction. ARD objective function is useful in determining the relative importance of the inputsto the model. ARD and evidence framework approach is describes in more details below.They have used the MLP back propagation neural network as shown in Fig 8. The hidden nodesof network have hyperbolic tangent transfer function and output layer has logistic sigmoidactivation function. In Fig 8, x represents the input vector, z represents the output of the hiddenunits and y represents the final output. The continuous output y(x) of this MLP classifier can beinterpreted as posterior probability ( | , which means the probability of getting class t = 16
17.
1 as output, given the input vector x. The Bayesian posterior probability estimates produced byMLP help classify the input vector to predefined classes by choosing a threshold in scoringinterval. While training the network, the weight vector w needs to be adjusted so that theobjective function which is sum of squared errors is minimized.They measured the accuracy of prediction with used of two metrics known as percentagecorrectly classified (PCC) and area under the receiver operating characteristic curve (AUROC). Figure 8: Example of three layers Neural Network [Viaene]While optimizing the neural classifier for best generalizations, it should be avoided fromlearning the noise in the training data, also known as over fitting. To avoid over fitting, usuallyvalidation dataset is used. A better approach is to add the regularization or penalty term to theobjective function. The unit based regularization term is also known as ARD. The final objectivefunction now becomes ∑ . 17
18.
They discussed about how critical is input selection to the overall classification process. The (regularization parameter) in ARD objective function is helpful in suppressing the weightsexiting from inputs. Larger the more irrelevant is the input and vice versa. Regularizationparameter allows MLP-ARD to include large number of potentially relevant input variables, thuseliminating the efforts needed to delete some irrelevant input variables. This also meansadjusting the degree of importance of the input variables in the classification process; this isknown as soft input selection.Bayesian learning is used to make the probabilistic models for the dataset. These models arethen used for prediction. Bayesian models are described in terms of posterior probabilitydensity over the weight space. Then prediction is made by integrating over the posteriorprobability. The evidence framework approach to Bayesian learning for MLP classifiers theydiscussed requires local Gaussian approximation to the posterior probability density. Theyintroduced the concept of input relevance or ARD on the evidence framework with the help ofthe Gaussian assumptions. The main objective of doing all this is to get the appropriate valuesfor the weight vector w and the regularization parameter .They used Personal Injury Protection (PIP) automobile insurance claim fraud detection datasetfor their evaluation. PIP claims dataset consists of 1399 closed automobile insurance claimsfrom accidents that occurred in Massachusetts, USA in 1993. This data has been investigatedfor fraud suspicion by the domain experts. The dataset included 25 binary fraud indicators (redflags); refer Fig 9 and 12 non indicator inputs (non-flags) that are valuable in assessing thefraudulent claim by investigators. In this dataset, ACC is accident, CLT is claimant, INJ is injury, 18
19.
and INS is insured driver .etc. The input selection is done after having discussions with domainexperts.These closed claims are reviewed by claim manager for suspected fraud on the basis of theseindicators or inputs. Each claim is categorized on a 10 point scale for suspected fraud. Claimsare also reviewed on the basis of verbal assessment by the claim manager. Claim can besuspected for fraud if suspicion score > = 4 and further investigation is done in this case,otherwise no investigation is done. Figure 9: PIP binary fraud indicators with values (0=No, 1=yes) [Viaene]In empirical evaluation they are doing input selection using MLP-ARD on the PIP insuranceclaims data. The input importance ranking they got from MLP-ARD is then compared with input 19
20.
importance rankings from logistic regression and decision tree learning. They used logisticregression approach to classification as a reference for comparison. They took the relativeimportance of inputs based upon the regression coefficient as a reference. They used decisiontrees approach for classification as a second reference. Implementation-wise they used the m-estimation smoothed and curtailed C4.5 variant, which is a better version of C4.5 algorithm.The input importance in decision tree is decided by its role in splitting the tree so thatmaximum entropy difference can be achieved. The relative performance of the decision treeimplementation in predicting the input was not quite good compared to logistic regression andMLP-ARD.10 fold cross validation can also be performed for input evaluation for the above threeapproaches of classification. This leads to ensemble based input assessment, which meansinput assessment is aggregated and then averaged for 10 models of the cross validation. Fig 10shows the input rankings derived from the three methods. Rank 1 input is the most important.The number in brackets is the input importance relative to the Rank 1. From the rankings it isobserved that six of the MLP-ARD top ten inputs are same as logistic top ten and seven aresame from the C4.5 top ten. Form the input rankings it is observed that MLP-ARD and logisticare giving comparable input rankings. All the three classifiers can be used at same time to givean ensemble classifier. 20
21.
Figure 10: Input Importance Ranking [Viaene]4. DiscussionI found some reasoning for the results missing from [Lu] paper on rule extraction. They did notexplain many results in their analysis. For e.g., they did not explain, why for function 4 theaccuracy with neural network is less than with C4.5. They did not explain why number ofconditions for function 5 is less per neural network rule than per C4.5 rule, while for all otherfunctions this is just the opposite case. This paper was written in the year 1996 and consideringthat time the research in this area was at nascent stage. They also did not explain how exactlythey arrived at the pruned network (refer Fig 2) with only four inputs. Their advocacy for the 21
22.
use of neural networks in classification is justified for some scenarios of data classificationwhere training time is not the constraint.While searching for the papers of the use of the neural networks in insurance industry, I foundthat not much research done is out there in public. Surely there must be some credible workdone on the uses of neural networks in the insurance companies, but due the competition it isnot disclosed.[Smith] have used SAS Enterprise Miner software for doing their analysis. While processing thedata they used the feature variable selection node of the SAS tool, but they did not explainanything about how this functionality will work without the tool. In their results they gaveclassification accuracies results for 0.1 and 0.5 decision thresholds. The way they presented thenumbers for these results for actually renewed, actually terminated, classified as renewed andclassified as terminated policies is not quite clear to me. Their evaluation is not quite strong asthey only present the lift chart for their comparisons with other classification approaches.The paper by [Viaene] does not go with the title of the paper “Auto claim fraud detection usingBayesian learning neural networks”. The researchers talk more about developing MLP-ARDapproach and incorporating it in the evidence framework method, than to talk about their usein detecting claim frauds. The focus is more on theoretical side with lots of equations. Thebackground information for the various methods used in the paper is very less making thepaper difficult to understand. A lot of assumptions and approximations have been used to whilemaking their method work for soft input selection. 22
23.
5. Conclusions and Future WorkUsing [Lu] method of extracting rules high quality rules can be obtained from the datasets.Their works acts a bridging approach on using neural networks for classification purposes indata mining. Time required for extracting rules is still large when compared to decision treeapproach. As a direction to future work they suggested the use of incremental training and ruleextraction from the database. Another way of reducing the training time and increasing theaccuracy is by reducing the input units of the network.[Smith] tried to find a ways of doing optimal pricing of policies while retaining growth andprofitability. Their case study used the neural networks to learn and predict customer retentionpatterns. They discussed some issues like the identification of misclassification cost to customerretention analysis. Second issue is the implementation and incorporation of their method in theinsurance industry at a larger scale and in real time. They would like to work on these issues incollaboration with the industry.[Viaene] made a step in the direction of understanding the underling semantics of the neuralnetworks output prediction. This understanding is important for the use of neural networks ineveryday decision making tasks for prediction claim frauds. The impact of the input selection onthe claim fraud detection process was their main concern. They demonstrated the soft inputselection capabilities of their proposed MLP-ARD method on the real life insurance dataset.I think neural networks due to their complex model making capabilities can be used moreeffectively in insurance and other industries and there is still scope of lot of work. 23
24.
References 1. [Lu] Hongjun Lu, Rudy Setiono and, Huan Liu, Effective Data Mining Using Neural Networks, Vol 8, IEEE Transactions on Knowledge and Data Engineering,1996, pp. 957- 961 2. [MacKay] MacKay, D. J. C., The evidence framework applied to classification networks. Neural Computation, 1992, 4(5), 720-736 3. [Scuse] David Scuse, Chapter 1 Intro, Class slides, University of Manitoba 4. [Setiono 1995] R. Setiono. A neural network construction algorithm which maximizes the likelihood function, Connection Science, Vol. 7, No. 2, 1995, pages 147-166. 5. [Setiono] R. Setiono. A penalty-function approach for pruning feed forward neural networks, Neural Computation, Vol. 9, No. 1, January 1997, pages 185-204. 6. [Smith] K.A. Smith, R.J. Willis and, M. Brooks, An Analysis of Customer Retention and Insurance Claim Patterns Using Data Mining: A Case Study, The Journal of the Operational Research Society, Vol. 51, May 2000, pp. 532-541 7. [Viaene] S. Viaene, G. Dedene and, R.A. Derrig, Auto claim fraud detection using Bayesian learning neural networks, Journal of Expert Systems with Applications, Vol. 29, pages 653 - 666, 2005 8. [Wiki] Data Mining, http://en.wikipedia.org/wiki/Data_mining 24
Be the first to comment