This document provides a summary of attributes from a dataset containing information on 2000 previous loan customers. It analyzes each attribute, noting any issues found like duplicates, outliers, or irrelevant attributes. Key attributes that may help predict loan repayment are identified as customer ID, age, current debt, and postcode. While other attributes like name, gender, and employment status have data quality or relevance issues. The goal is to select the best data mining technique to differentiate customers and predict the likelihood of future loan repayment.
This document discusses predicting credit default risk using a Home Credit dataset. It summarizes analyzing the data using logistic regression, dealing with imbalanced classes and multicollinearity. Model improvements using SMOTE sampling and excluding correlated variables increased sensitivity from 0.01 to 0.33, though accuracy and AUC decreased. Further improvements could involve other models, and handling missing values without exclusion.
The document discusses developing a default-forecasting model to predict mortgage loan defaults. It reviews previous literature on this topic that has used logistic regression and survival analysis. The proposed model will use logistic regression to predict default probabilities based on loan, economic, and borrower characteristics from a dataset of over 20 million loan entries provided by Freddie Mac. Five separate logistic regression models will be developed for different categories of loans based on their age and delinquency status.
This study examines factors that affect interest rates in peer-to-peer lending by analyzing data from 2,500 loans through Lending Club. A multiple regression model related interest rate to four factors: FICO score, monthly income, debt-to-income ratio, and length of employment. The analysis found that all factors except length of employment had a statistically significant relationship with interest rate, though the effects were small. FICO score had the strongest negative relationship, while income and debt-to-income ratio had a positive relationship. This suggests credit risk still impacts interest rates in peer-to-peer lending, as do income levels and debt burdens to a lesser degree.
The document analyzes variables associated with the interest rate of loans. It uses a dataset of peer-to-peer loans to build a multiple linear regression model relating interest rate to applicant characteristics. These include FICO score, amount requested, amount loaned, debt-to-income ratio, number of open credit lines, outstanding credit balances, credit inquiries, and loan length. The model fits the data well (R-squared = 0.7942) and finds a strong association between interest rate and FICO score in particular. The analysis identifies key variables for determining loan interest rates.
The document is a group report for redesigning the Metrolink mobile app. It includes:
- An executive summary outlining flaws in the current app and the goals of redesigning it.
- Details of the methodology used including gathering user requirements, collaborating on ideas, creating navigation diagrams and prototypes.
- Usability evaluations were conducted on the prototypes including SUS surveys, Nielsen's heuristics analysis and heuristic statements.
- The final redesigned app is presented through storyboards and a high-level prototype, addressing user needs through new features like smart journeys, location-based notifications and a rewards system.
Mr. Darran Joseph Mottershead received a First Class Honours degree in BSC(HONS) Computing from Institution ID 11033545 for the 2013/2014 academic period. He earned credits in four level 6 courses: Data Engineering, Human Computer Interaction, Mobile Applications Development, and Project. Mr. Mottershead passed all courses, earning marks ranging from 67 to 96, and received a First Class Honours result.
The document discusses three exercises related to data mining and machine learning algorithms:
1. Drawing a decision tree corresponding to a logical formula in disjunctive normal form about loan eligibility.
2. Calculating the information gain of attributes to determine the best attribute to use for the first branch in a decision tree.
3. Providing an example of when a decision tree would overfit the training data by perfectly modeling unique observations but being unable to generalize to new data.
This document discusses predicting credit default risk using a Home Credit dataset. It summarizes analyzing the data using logistic regression, dealing with imbalanced classes and multicollinearity. Model improvements using SMOTE sampling and excluding correlated variables increased sensitivity from 0.01 to 0.33, though accuracy and AUC decreased. Further improvements could involve other models, and handling missing values without exclusion.
The document discusses developing a default-forecasting model to predict mortgage loan defaults. It reviews previous literature on this topic that has used logistic regression and survival analysis. The proposed model will use logistic regression to predict default probabilities based on loan, economic, and borrower characteristics from a dataset of over 20 million loan entries provided by Freddie Mac. Five separate logistic regression models will be developed for different categories of loans based on their age and delinquency status.
This study examines factors that affect interest rates in peer-to-peer lending by analyzing data from 2,500 loans through Lending Club. A multiple regression model related interest rate to four factors: FICO score, monthly income, debt-to-income ratio, and length of employment. The analysis found that all factors except length of employment had a statistically significant relationship with interest rate, though the effects were small. FICO score had the strongest negative relationship, while income and debt-to-income ratio had a positive relationship. This suggests credit risk still impacts interest rates in peer-to-peer lending, as do income levels and debt burdens to a lesser degree.
The document analyzes variables associated with the interest rate of loans. It uses a dataset of peer-to-peer loans to build a multiple linear regression model relating interest rate to applicant characteristics. These include FICO score, amount requested, amount loaned, debt-to-income ratio, number of open credit lines, outstanding credit balances, credit inquiries, and loan length. The model fits the data well (R-squared = 0.7942) and finds a strong association between interest rate and FICO score in particular. The analysis identifies key variables for determining loan interest rates.
The document is a group report for redesigning the Metrolink mobile app. It includes:
- An executive summary outlining flaws in the current app and the goals of redesigning it.
- Details of the methodology used including gathering user requirements, collaborating on ideas, creating navigation diagrams and prototypes.
- Usability evaluations were conducted on the prototypes including SUS surveys, Nielsen's heuristics analysis and heuristic statements.
- The final redesigned app is presented through storyboards and a high-level prototype, addressing user needs through new features like smart journeys, location-based notifications and a rewards system.
Mr. Darran Joseph Mottershead received a First Class Honours degree in BSC(HONS) Computing from Institution ID 11033545 for the 2013/2014 academic period. He earned credits in four level 6 courses: Data Engineering, Human Computer Interaction, Mobile Applications Development, and Project. Mr. Mottershead passed all courses, earning marks ranging from 67 to 96, and received a First Class Honours result.
The document discusses three exercises related to data mining and machine learning algorithms:
1. Drawing a decision tree corresponding to a logical formula in disjunctive normal form about loan eligibility.
2. Calculating the information gain of attributes to determine the best attribute to use for the first branch in a decision tree.
3. Providing an example of when a decision tree would overfit the training data by perfectly modeling unique observations but being unable to generalize to new data.
This document discusses applying data mining techniques to predict whether a football match will be cancelled due to weather. Attributes that could be used in the prediction include amount of rain, temperature, humidity, and weather conditions. The data would come from weather stations and the football club. A second example discusses using attributes like player numbers, injuries, past goals, and team performance to predict the outcome of a match between Ajax and Real Madrid. The document also explains the difference between a training set, which is used to weight attributes, and a test set, which is used to evaluate the training set's predictions. Key data mining concepts like features, instances, and classes are briefly defined with examples.
This document discusses rule-based classification. It describes how rule-based classification models use if-then rules to classify data. It covers extracting rules from decision trees and directly from training data. Key points include using sequential covering algorithms to iteratively learn rules that each cover positive examples of a class, and measuring rule quality based on both coverage and accuracy to determine the best rules.
Este documento describe un ejemplo de clasificación de medicamentos para la rinitis alérgica usando el software Weka. Se analizan variables como edad, sexo y niveles de colesterol y sodio de pacientes para predecir el mejor fármaco entre DrogaA, DrogaB, DrogaC, DrogaX o DrogaY. El método J48 obtuvo una precisión del 92.5%, mostrando que DrogaY es el fármaco más efectivo, especialmente en pacientes de diferentes edades.
This document provides an overview of decision trees, including:
- Decision trees classify records by sorting them down the tree from root to leaf node, where each leaf represents a classification outcome.
- Trees are constructed top-down by selecting the most informative attribute to split on at each node, usually based on information gain.
- Trees can handle both numerical and categorical data and produce classification rules from paths in the tree.
- Examples of decision tree algorithms like ID3 that use information gain to select the best splitting attribute are described. The concepts of entropy and information gain are defined for selecting splits.
Classification of commercial and personal profiles on my spacees712
This document presents research on classifying profiles on MySpace as either commercial or personal. A decision tree classifier called J48 was developed that uses profile attributes such as gender, age, friends, and publishing relationships. The J48 classifier achieved an accuracy of 92.25% to 96.42% on test data. The classifier is then applied to a privacy-preserving system where personal profiles can publish anonymously through an "avatar" to avoid disclosing private information to commercial profiles.
Talk to Student Information Point Colleagues about MoodleMark Stubbs
This slidedeck was written to support a workshop with Student Information Point colleagues about process improvement - it sets out plans for MMU's new Moodle VLE, highlighting the importance of tagging with curriculum codes to aggregate content from different systems into Moodle
This document discusses Manchester Metropolitan University's efforts to improve the student experience through a coordinated initiative called EQAL. EQAL aims to transform the curriculum, administration processes, learning technologies, and quality assurance processes. As part of this initiative, MMU is developing a new integrated virtual learning environment called Talis Aspire that will provide students with a seamless, personalized online experience across devices including mobile access to resources like timetables, reading lists, and submitting assignments. The university sees mobile and responsive design as important to wrapping resources and services around the learner.
This document summarizes a presentation about using student data to improve the student experience at a large university undergoing transformation. It discusses problems students and staff faced regarding standardized curriculum, assessments, timetabling, and quality processes. A coordinated EQAL program was established to address these issues through new curriculum, systems, learning spaces, and quality processes. Data from internal student surveys and other sources was analyzed to better understand student satisfaction predictors and feedback. Key lessons focused on hands-on management, communication, an inclusive approach, addressing root problems, and working with staff to jointly develop improvement plans using data.
This document discusses Newcastle University's approach to module evaluation as a tool to improve teaching quality. It provides an overview of the university and the context for module evaluations. It describes the process of implementing online module evaluations across the university using a centralized system with common questions. It discusses challenges engaging both staff and students, and strategies used. It concludes by reflecting on lessons learned and plans for continued improvements.
The document discusses a student project investigating the phenomenon of "banner-blindness" in online advertising. The project aims to study how different elements of advertisements contribute to their effectiveness or lack of effectiveness. The student conducted an experiment using eye-tracking software to observe participants viewing websites and search engine results pages with varying advertisement criteria. Questionnaires were also used after to measure participants' recall and recognition of the advertisements. The results found that advertisement location, proximity to other elements, and inclusion of certain elements impacted how well advertisements were remembered. While no complete solution was found, factors like relevance and inclusion of specific elements seemed to improve advertisement effectiveness.
This chapter discusses selection statements in C++. It covers relational expressions, if-else statements, nested if statements, the switch statement, and common programming errors. Relational expressions are used to compare operands and evaluate to true or false. If-else statements select between two statements based on a condition. Nested if statements allow if statements within other if statements. The switch statement compares a value to multiple cases and executes the matching case's statements. Programming errors can occur from incorrect operators, unpaired braces, and untested conditions.
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSeditorijettcs
This document discusses various data mining techniques. It begins with an introduction to data mining, explaining that it is used to discover patterns in large datasets. It then describes five major techniques: association, which finds relationships between items purchased together; classification, which assigns items to predefined categories; clustering, which automatically groups similar objects; prediction, which discovers relationships to predict future outcomes; and sequential patterns, which finds patterns over time. The document concludes by discussing some applications of data mining such as customer profiling, website analysis, and fraud detection.
The document discusses conditional statements in C# including if, if-else, nested if statements, and switch-case statements. It covers:
- Comparison and logical operators that are used to compose logical conditions for conditional statements
- How the if and if-else statements provide conditional execution of code blocks based on evaluating conditions
- Nested if statements allow creating more complex logic by placing if statements inside other if or else blocks
- The switch-case statement selects code for execution depending on the value of an expression, making it useful for multiple comparisons
The document discusses various concepts in data mining and decision trees including:
1) Pruning trees to address overfitting and improve generalization,
2) Separating data into training, development and test sets to evaluate model performance,
3) Information gain favoring attributes with many values by having less entropy,
4) Strategies for dealing with missing attribute values such as predicting values or focusing on other attributes/classes,
5) Changing stopping conditions for regression trees to use standard deviation thresholds rather than discrete classes.
Tree pruning removes parts of a decision tree that overfit the training data due to noise or outliers, making the tree smaller and less complex. There are two pruning strategies: postpruning removes subtrees after full tree growth, while prepruning stops growing branches when information becomes unreliable. Decision tree algorithms are efficient for small datasets but have performance issues for very large real-world datasets that may not fit in memory.
The document discusses perceptrons and gradient descent algorithms for training perceptrons on classification tasks. It contains 4 exercises:
1) Explains the role of the learning rate in perceptron training and which Boolean functions can/cannot be modeled with perceptrons.
2) Applies a perceptron to a sample dataset, calculates outputs, and determines the accuracy.
3) Performs one iteration of gradient descent on the same dataset, computing weight updates with a learning rate of 0.2.
4) Performs one iteration of stochastic gradient descent on the dataset, recomputing outputs and updating weights after each instance.
The document describes the Knowledge Flow Interface in Weka, which allows users to visually connect components like data loaders, classifiers, and evaluators to build machine learning workflows. As a demonstration, it walks through building a workflow that loads ARFF data, performs attribute assignment, 10-fold cross-validation with J48 decision trees, and evaluates the classifier performance. Additional components like a graph viewer are included to visualize the decision trees from each fold.
Moodle MOOC 9: Setting up a Gradebook for AssignmentsRamesh C. Sharma
Setting up the Gradebook for Assignments in Moodle because assessment is an important activity and Moodle can be easily configured for assignments evaluation and grades management.
The document describes speech channel assignment and channel mode modification procedures in 3 G mobile networks. It discusses (1) how the BSC assigns TCH channels to an MS based on a service request, (2) the internal BSC signaling for channel assignment, and (3) how the BSC modifies the channel mode based on an assignment request from the MSC.
This document discusses challenges with collecting accurate and complete data in Homeless Management Information Systems (HMIS). It notes that data quality is influenced by many personalities, including clients, end users entering data, and the variables themselves. Specific issues are discussed for variables like name, social security number, date of birth, race, gender, disability status, income, and housing status. Accuracy depends on properly training staff on interview techniques, understanding skip patterns and response options, and updating records over time. The document emphasizes that while data is used for important purposes, the limitations of self-reported data from a hard-to-reach population must also be acknowledged.
DAT 520 Milestone Three Guidelines and Rubric In this m.docxsimonithomas47935
DAT 520 Milestone Three Guidelines and Rubric
In this milestone, you will perform an evaluation of your decision model and revise your decision model as needed. Evaluation examples are if you are
performing a bottom-up style recursive partitioning analysis, and you should report on the error rate and variable selection. You might also consider alternative
variable categorizations to improve your model. If you are performing a top -down decision tree modeling exercise, what are the threshold values that cause the
tree to flip? You should perform sensitivi ty analysis on the critical variables in your tree and report what those sensitivity analyses are telling you. For either sty le
of modeling, what makes your tree stronger? What breaks the model? For more information on completing this milestone, please ref er to the Final Project
Notes in the Assignment Guidelines and Rubrics folder.
Specifically, the following critical elements must be addressed in your final submission:
Include the structure of your revised decision tree, with a clear description.
Evaluate the results of your revised model, including analysis that is specific to your revised model. In your evaluation, reflect on the appropriateness
and adjustments of the revised model, as well as the accuracy of the results you obtained.
Suitable diagnostics should be incorporated into the model.
Guidelines for Submission: This milestone should be 2 to 3 double-spaced pages of text, with tree model images and any other supporting material appended.
Review your work to ensure that there are no major errors in writing mechanics. If you have citations, include the sources at the end and cite them APA format.
Critical Elements Proficient (100%) Needs Improvement (70%) Not Evident (0%) Value
Structure Deci s i on tree and des cri ption are cl early
s tructured
Deci s i on tree and des cri ption are
s omewhat cl early s tructured
Deci s i on tree and des cri pti on are not
adequatel y s tructured
30
Evaluation of Results Eval uati on cons i ders reas onablenes s ,
accuracy, mi s s ing/extraneous el ements ,
and error i n the model
Eval uati on does not ful l y cons i der
reas onabl enes s , accuracy,
mi s s i ng/extraneous el ements , and error
i n the model
Eval uati on does not cons i der
reas onabl enes s , accuracy,
mi s s i ng/extraneous el ements , and error
i n the model
30
Model Diagnostics Model i ncl udes cl ear us e of di agnos ti cs Model bui l ds i n parti al us e of
di agnos ti cs
Model does not i ncl ude di agnos ti cs 30
Articulation of
Response
Submi s s i on has no major errors rel ated
to grammar, s pel l i ng, s yntax, or
organi zati on
Submi s s i on has major errors rel ated to
grammar, s pel l i ng, s yntax, or
organi zati on that negati vel y i mpact
readabi l ity and arti culation of mai n
i deas
Submi s s i on has criti cal errors rel ated to
grammar, s pel l.
This document discusses applying data mining techniques to predict whether a football match will be cancelled due to weather. Attributes that could be used in the prediction include amount of rain, temperature, humidity, and weather conditions. The data would come from weather stations and the football club. A second example discusses using attributes like player numbers, injuries, past goals, and team performance to predict the outcome of a match between Ajax and Real Madrid. The document also explains the difference between a training set, which is used to weight attributes, and a test set, which is used to evaluate the training set's predictions. Key data mining concepts like features, instances, and classes are briefly defined with examples.
This document discusses rule-based classification. It describes how rule-based classification models use if-then rules to classify data. It covers extracting rules from decision trees and directly from training data. Key points include using sequential covering algorithms to iteratively learn rules that each cover positive examples of a class, and measuring rule quality based on both coverage and accuracy to determine the best rules.
Este documento describe un ejemplo de clasificación de medicamentos para la rinitis alérgica usando el software Weka. Se analizan variables como edad, sexo y niveles de colesterol y sodio de pacientes para predecir el mejor fármaco entre DrogaA, DrogaB, DrogaC, DrogaX o DrogaY. El método J48 obtuvo una precisión del 92.5%, mostrando que DrogaY es el fármaco más efectivo, especialmente en pacientes de diferentes edades.
This document provides an overview of decision trees, including:
- Decision trees classify records by sorting them down the tree from root to leaf node, where each leaf represents a classification outcome.
- Trees are constructed top-down by selecting the most informative attribute to split on at each node, usually based on information gain.
- Trees can handle both numerical and categorical data and produce classification rules from paths in the tree.
- Examples of decision tree algorithms like ID3 that use information gain to select the best splitting attribute are described. The concepts of entropy and information gain are defined for selecting splits.
Classification of commercial and personal profiles on my spacees712
This document presents research on classifying profiles on MySpace as either commercial or personal. A decision tree classifier called J48 was developed that uses profile attributes such as gender, age, friends, and publishing relationships. The J48 classifier achieved an accuracy of 92.25% to 96.42% on test data. The classifier is then applied to a privacy-preserving system where personal profiles can publish anonymously through an "avatar" to avoid disclosing private information to commercial profiles.
Talk to Student Information Point Colleagues about MoodleMark Stubbs
This slidedeck was written to support a workshop with Student Information Point colleagues about process improvement - it sets out plans for MMU's new Moodle VLE, highlighting the importance of tagging with curriculum codes to aggregate content from different systems into Moodle
This document discusses Manchester Metropolitan University's efforts to improve the student experience through a coordinated initiative called EQAL. EQAL aims to transform the curriculum, administration processes, learning technologies, and quality assurance processes. As part of this initiative, MMU is developing a new integrated virtual learning environment called Talis Aspire that will provide students with a seamless, personalized online experience across devices including mobile access to resources like timetables, reading lists, and submitting assignments. The university sees mobile and responsive design as important to wrapping resources and services around the learner.
This document summarizes a presentation about using student data to improve the student experience at a large university undergoing transformation. It discusses problems students and staff faced regarding standardized curriculum, assessments, timetabling, and quality processes. A coordinated EQAL program was established to address these issues through new curriculum, systems, learning spaces, and quality processes. Data from internal student surveys and other sources was analyzed to better understand student satisfaction predictors and feedback. Key lessons focused on hands-on management, communication, an inclusive approach, addressing root problems, and working with staff to jointly develop improvement plans using data.
This document discusses Newcastle University's approach to module evaluation as a tool to improve teaching quality. It provides an overview of the university and the context for module evaluations. It describes the process of implementing online module evaluations across the university using a centralized system with common questions. It discusses challenges engaging both staff and students, and strategies used. It concludes by reflecting on lessons learned and plans for continued improvements.
The document discusses a student project investigating the phenomenon of "banner-blindness" in online advertising. The project aims to study how different elements of advertisements contribute to their effectiveness or lack of effectiveness. The student conducted an experiment using eye-tracking software to observe participants viewing websites and search engine results pages with varying advertisement criteria. Questionnaires were also used after to measure participants' recall and recognition of the advertisements. The results found that advertisement location, proximity to other elements, and inclusion of certain elements impacted how well advertisements were remembered. While no complete solution was found, factors like relevance and inclusion of specific elements seemed to improve advertisement effectiveness.
This chapter discusses selection statements in C++. It covers relational expressions, if-else statements, nested if statements, the switch statement, and common programming errors. Relational expressions are used to compare operands and evaluate to true or false. If-else statements select between two statements based on a condition. Nested if statements allow if statements within other if statements. The switch statement compares a value to multiple cases and executes the matching case's statements. Programming errors can occur from incorrect operators, unpaired braces, and untested conditions.
EXPLORING DATA MINING TECHNIQUES AND ITS APPLICATIONSeditorijettcs
This document discusses various data mining techniques. It begins with an introduction to data mining, explaining that it is used to discover patterns in large datasets. It then describes five major techniques: association, which finds relationships between items purchased together; classification, which assigns items to predefined categories; clustering, which automatically groups similar objects; prediction, which discovers relationships to predict future outcomes; and sequential patterns, which finds patterns over time. The document concludes by discussing some applications of data mining such as customer profiling, website analysis, and fraud detection.
The document discusses conditional statements in C# including if, if-else, nested if statements, and switch-case statements. It covers:
- Comparison and logical operators that are used to compose logical conditions for conditional statements
- How the if and if-else statements provide conditional execution of code blocks based on evaluating conditions
- Nested if statements allow creating more complex logic by placing if statements inside other if or else blocks
- The switch-case statement selects code for execution depending on the value of an expression, making it useful for multiple comparisons
The document discusses various concepts in data mining and decision trees including:
1) Pruning trees to address overfitting and improve generalization,
2) Separating data into training, development and test sets to evaluate model performance,
3) Information gain favoring attributes with many values by having less entropy,
4) Strategies for dealing with missing attribute values such as predicting values or focusing on other attributes/classes,
5) Changing stopping conditions for regression trees to use standard deviation thresholds rather than discrete classes.
Tree pruning removes parts of a decision tree that overfit the training data due to noise or outliers, making the tree smaller and less complex. There are two pruning strategies: postpruning removes subtrees after full tree growth, while prepruning stops growing branches when information becomes unreliable. Decision tree algorithms are efficient for small datasets but have performance issues for very large real-world datasets that may not fit in memory.
The document discusses perceptrons and gradient descent algorithms for training perceptrons on classification tasks. It contains 4 exercises:
1) Explains the role of the learning rate in perceptron training and which Boolean functions can/cannot be modeled with perceptrons.
2) Applies a perceptron to a sample dataset, calculates outputs, and determines the accuracy.
3) Performs one iteration of gradient descent on the same dataset, computing weight updates with a learning rate of 0.2.
4) Performs one iteration of stochastic gradient descent on the dataset, recomputing outputs and updating weights after each instance.
The document describes the Knowledge Flow Interface in Weka, which allows users to visually connect components like data loaders, classifiers, and evaluators to build machine learning workflows. As a demonstration, it walks through building a workflow that loads ARFF data, performs attribute assignment, 10-fold cross-validation with J48 decision trees, and evaluates the classifier performance. Additional components like a graph viewer are included to visualize the decision trees from each fold.
Moodle MOOC 9: Setting up a Gradebook for AssignmentsRamesh C. Sharma
Setting up the Gradebook for Assignments in Moodle because assessment is an important activity and Moodle can be easily configured for assignments evaluation and grades management.
The document describes speech channel assignment and channel mode modification procedures in 3 G mobile networks. It discusses (1) how the BSC assigns TCH channels to an MS based on a service request, (2) the internal BSC signaling for channel assignment, and (3) how the BSC modifies the channel mode based on an assignment request from the MSC.
This document discusses challenges with collecting accurate and complete data in Homeless Management Information Systems (HMIS). It notes that data quality is influenced by many personalities, including clients, end users entering data, and the variables themselves. Specific issues are discussed for variables like name, social security number, date of birth, race, gender, disability status, income, and housing status. Accuracy depends on properly training staff on interview techniques, understanding skip patterns and response options, and updating records over time. The document emphasizes that while data is used for important purposes, the limitations of self-reported data from a hard-to-reach population must also be acknowledged.
DAT 520 Milestone Three Guidelines and Rubric In this m.docxsimonithomas47935
DAT 520 Milestone Three Guidelines and Rubric
In this milestone, you will perform an evaluation of your decision model and revise your decision model as needed. Evaluation examples are if you are
performing a bottom-up style recursive partitioning analysis, and you should report on the error rate and variable selection. You might also consider alternative
variable categorizations to improve your model. If you are performing a top -down decision tree modeling exercise, what are the threshold values that cause the
tree to flip? You should perform sensitivi ty analysis on the critical variables in your tree and report what those sensitivity analyses are telling you. For either sty le
of modeling, what makes your tree stronger? What breaks the model? For more information on completing this milestone, please ref er to the Final Project
Notes in the Assignment Guidelines and Rubrics folder.
Specifically, the following critical elements must be addressed in your final submission:
Include the structure of your revised decision tree, with a clear description.
Evaluate the results of your revised model, including analysis that is specific to your revised model. In your evaluation, reflect on the appropriateness
and adjustments of the revised model, as well as the accuracy of the results you obtained.
Suitable diagnostics should be incorporated into the model.
Guidelines for Submission: This milestone should be 2 to 3 double-spaced pages of text, with tree model images and any other supporting material appended.
Review your work to ensure that there are no major errors in writing mechanics. If you have citations, include the sources at the end and cite them APA format.
Critical Elements Proficient (100%) Needs Improvement (70%) Not Evident (0%) Value
Structure Deci s i on tree and des cri ption are cl early
s tructured
Deci s i on tree and des cri ption are
s omewhat cl early s tructured
Deci s i on tree and des cri pti on are not
adequatel y s tructured
30
Evaluation of Results Eval uati on cons i ders reas onablenes s ,
accuracy, mi s s ing/extraneous el ements ,
and error i n the model
Eval uati on does not ful l y cons i der
reas onabl enes s , accuracy,
mi s s i ng/extraneous el ements , and error
i n the model
Eval uati on does not cons i der
reas onabl enes s , accuracy,
mi s s i ng/extraneous el ements , and error
i n the model
30
Model Diagnostics Model i ncl udes cl ear us e of di agnos ti cs Model bui l ds i n parti al us e of
di agnos ti cs
Model does not i ncl ude di agnos ti cs 30
Articulation of
Response
Submi s s i on has no major errors rel ated
to grammar, s pel l i ng, s yntax, or
organi zati on
Submi s s i on has major errors rel ated to
grammar, s pel l i ng, s yntax, or
organi zati on that negati vel y i mpact
readabi l ity and arti culation of mai n
i deas
Submi s s i on has criti cal errors rel ated to
grammar, s pel l.
A Holistic Approach to Property ValuationsCognizant
A brief guide to incorporating all available data, structured and unstructured, in the property valuation activity, by choosing key variables and running textual analysis.
The document discusses better underwriting practices in auto financing. It covers the state of consumer credit data and opportunities in non-prime lending. Key components of good underwriting include assessing a buyer's capacity to pay, collateral value, and credit data. Alternative credit data from sources beyond the major credit bureaus can help qualify more consumers, reduce losses, and improve conversion rates. Such data provides insights into applicants' identities, addresses, employment histories, and stability that can benefit underwriting decisions. Questions about using alternative data and protecting customers are addressed.
The document discusses entity relationship (ER) modeling concepts including:
- Entities, attributes, and relationships can be represented graphically in ER diagrams
- Relationships have cardinalities like one-to-one, one-to-many, many-to-many that specify how entities are associated
- Weak entities depend on other entities and cannot be uniquely identified without attributes from the associated entity
The document discusses entity relationship (ER) modeling concepts including:
- Entities, attributes, and relationships can be represented graphically or as tables
- Key types include primary keys and composite keys
- Relationship cardinalities include one-to-one, one-to-many, many-to-many
- Weak entities cannot be uniquely identified without attributes from a related strong entity
https://ijitce.com/index.php
Our journal maintains rigorous peer review standards. Each submitted article undergoes a thorough evaluation by experts in the respective field. This stringent review process helps ensure that only high-quality and scientifically sound research is accepted for publication. Researchers can trust that the articles they find in IJITCE have been critically assessed for validity, significance, and originality.
Interpretation of Data and Statistical FallaciesRHIMRJ Journal
This document discusses common factors that can lead to misinterpretation of statistical data and statistical fallacies. Some key factors discussed include bias, inconsistent definitions, false generalization from incomplete data, improper comparisons, and wrong interpretation of statistical measures like averages, dispersion, and correlation. Technical errors in data collection, analysis, and reporting can also result in incorrect conclusions. Overall, the document emphasizes the importance of careful interpretation of statistical data and avoiding common pitfalls to ensure valid and meaningful conclusions are drawn.
The document analyzes data from a credit case study to identify patterns relating to loan repayment. It finds that loan purposes for repairs and an income type of working have higher rates of unsuccessful payments. Housing in a co-op apartment is also associated with higher difficulty in repayment. The analysis recommends banks focus on contract types like student and pensioner, as well as housing types other than co-op apartments, to increase successful payments. It also suggests caution when providing loans for repairs while emphasizing focusing on housing with parents, houses/apartments, and municipal apartments.
This project aims to predict whether a loan application will be approved or denied based on various factors such as applicant's income, credit score, loan amount, etc. Using a dataset containing historical loan application data, we employed machine learning algorithms to build a predictive model. The model was trained on features such as applicant's income, credit history, loan amount, loan term, and others. After training the model, we evaluated its performance using metrics like accuracy, precision, recall, and F1 score. The insights from this project can help financial institutions streamline their loan approval process and make informed decisions. Visit for more information: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
DataBase Management System Relationship
Posted on January 4, 2024
Introduction
The PAW Fondation links animal right and well-being campagnes worldwide. Many Works for its abord branches. PAW pourchasse a massive data base to simplifie Operations. The Project data analyst wants to build and Install a SQL data base for the organisation. Branch, worker, member, suscription, payement, contribution, and other donation data Will Be retained. PAW’s unusual organisationnel structure—one Branch per zip code—uses a well-built data base to engage with other animal protection NGOs. The system Will process monetary donations, other presents, and complex Relationship between Works, managers, subscribers, and other contributors.
The data base now includes “Volontiers” and “Events,” Along with the Project scenarios main components, to improve It. Without compassionate chapter volontiers, PAW would Fail. Event inclusion promotes community-building via Schedule events. This Project Will meet PAW’s data management needs and prepare for analytics.
Following sections explain the logical data model, Entity-Relationship Diagram (ERD), and SQL data base design. Each data base design process evaluated accuracy, efficience, and Protect Animal Welfare’s worldwide animal welfare activism.
Part A:
As I Am now working on a series of practice problems for ERD, I was wondering what the best strategy is for modelling Ethier or Relationship. Could You perhaps provider me with some information? At This very moment, I Am working on the collection of questions. At the moment, I Am working on tasks That are considered to Be practice problems.
For exemple, You Will Be responsable for maintaining Customer accounts at a Taekwondo school. These accounts Will Be in charge of representing and paying for one or more pupils. Using these accounts to make payements is Something That is going to Be done in the future. The accounts in question are ones That the organisation has the potentiel to acquire in the future. This issue Will Be decided by the conditions That are now in existence; nonetheless, There is a chance That the account is owned by Ethier the student or a parent. Nevertheless, This matter Will Be resolved. Depending on the circumstances, the student of the parent is the one whois is the owner of the account. This is because the student is the owner of the account, which is the reason for this difference.
Relationship:
Sure, let’s identify the types of Relationship for each pair of tables:
1. One-to-One Relationship:
– Suscriptions and Payements: Each suscription has one correspondant payement, and Eich payement is relate to one suscription.
2. One-to-Many Relationship:
– Branches to Employees: One Branch Can have man employées, but an employée bélongs to onlay one Branch.
– Branches to Volontiers: One Branch Can have man volontiers, but a volontiers bélongs to onlay one Branch.
– Branches to Events: One Branch Can organise man events, but an évent is Associate
This document contains the final assessment for a Quantitative Methods and Data Analysis module. It includes 5 tasks:
Task 1 defines variables and their properties in an SPSS data view.
Task 2 analyzes consumer characteristics at a retail center using cross-tabulation and bar charts, finding most consumers are married and employed.
Task 3 examines consumer spending based on characteristics, finding outliers in some groups and differences between occupations and appreciation levels.
Task 4 analyzes the distribution of monthly spending and distance traveled, finding neither are normally distributed using various tests.
Task 5 proposes using a t-test to examine differences in age and gender, stating the null and alternative hypotheses.
A nurse named Florence Nightingale used data visualization to reduce mortality rates during the Crimean War. She created a pie chart showing the causes of soldier deaths, with the largest slice representing deaths from avoidable hospital diseases. This helped Nightingale convince officials to improve sanitation in hospitals, which reduced the death rate from 40% to 2%. The document then discusses how data storytelling can help individuals advance their careers and provides tips on summarizing data insights concisely for different audiences.
Pay scale presentation 7 questions to ask about salary data sourcesPayScale, Inc.
The document discusses 7 key questions to ask when evaluating salary surveys. It compares traditional annual salary surveys conducted by consultants to the crowdsourced data available from PayScale. PayScale's data is continuously collected and processed, covering unlimited locations and combinations of skills. It provides transparent details on its 250+ compensable factors and data sources to help users determine the best fit for their unique compensation needs.
This document discusses four research studies related to sexual abuse. The first study examines reasons why sexual abuse is rarely reported, finding that fear, humiliation, and doubts about the legal system are common reasons. The second study looks at prevalence of sexual abuse among youth, especially LGBTQ individuals. The third evaluates the psychosocial impacts of sexual abuse on adolescents. The fourth analyzes common feelings after abuse like powerlessness and stigmatization. Together these studies explore topics like reporting rates, prevalence, impacts, and experiences of sexual abuse.
The document discusses using big data to improve analytics for financial services applications. It provides examples of using additional unstructured data sources like news feeds, research reports, social media etc. to enhance models for fraud detection, customer attrition analysis, and estimating default correlation and pricing securitized bonds. Currently these models only use structured transactional data but incorporating additional context from big data sources could provide more accurate insights and predictions.
This document discusses using big data and analytics within Oracle's Financial Services Analytical Applications (OFSAA) platform. It provides examples of using additional unstructured data sources like news feeds, research reports, and social media to improve analytics for detecting fraud, predicting customer attrition, and more accurately estimating default correlation and pricing securitized bonds. The document outlines how OFSAA currently uses only structured data but could be enhanced by incorporating big data sources to power more predictive models.
This infographic uses a stacked bar graph to illustrate stalking behaviors experienced by male and female victims from male and female stalkers. It finds that unwanted contact through communication methods is the most common behavior experienced. The purpose is to spread awareness of stalking and help victims seek support. It will be distributed through public places in Brisbane, Australia to reach a wide audience.
This document describes building models to predict credit card default payments. It retrieves credit card data from a public dataset containing details on customers' personal information, credit limits, payment histories and default statuses. The data is explored through visualizations to identify relationships between variables. Two classification models are built using KNN and decision tree algorithms. The decision tree model achieves a higher accuracy of 80% compared to KNN's 74% accuracy, indicating decision trees are more suitable for predicting default payments from this credit card data.
Poor data quality costs businesses significantly through wasted labor, lost productivity and direct financial losses. Rapid changes in circumstances like addresses, phone numbers, jobs and relationships mean that master data changes at a rate of around 2% per month. Siebel's data management solution and new data quality products from Oracle help address these issues through comprehensive data profiling, cleansing, matching and enrichment capabilities. Regular data stewardship is important for ongoing monitoring and governance to maintain high quality master data.
Similar to Data Engineering - Data Mining Assignment (20)
1. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Data Mining Assignment!
Data Mining Assignment! 1!
1.0 Introduction! 1!
2.0 Data Summary! 2!
2.1 Customer ID! 2!
2.2 Forename/Surname! 3!
2.3 Age! 3!
2.4 Gender! 4!
2.5 Years at Address! 5!
2.6 Employment Status! 6!
2.7 Country! 6!
2.8 Current Debt! 7!
2.9 Postcode! 7!
2.10 Income! 8!
2.11 Own Home! 8!
2.12 CCJs! 9!
2.13 Loan Amount! 10!
2.14 Outcome! 10!
3.0 Data Mining Technique Selection! 11!
3.1 Decision Trees! 11!
3.2 Naïve Bayes! 13!
4.0 Data Exploration and Preparation! 14!
4.1 Relevant Data! 14!
4.2 Noise! 14!
4.3 Outliers! 15!
5.0 Modelling! 16!
2. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Experiment One! 16!
Parameters! 19!
Experiment Two (J48)! 20!
Experiment Three (J48)! 21!
Experiment Four (Naive Bayes)! 22!
Experiment Five (Naive Bayes)! 23!
6.0 Conclusion! 24!
6.1 Final Results! 24!
7.0 References! 25
3. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
1.0 Introduction!
The assignment brief that ‘Banks are currently having trouble with debt, also states that
the banks would like to avoid lending money to people in the future, who are unable to to
repay these loans. Using the data provided from 2000 previous loan customers,
appropriate data mining techniques allow differentiation, and prediction of how likely a
customer is to repay their loans. Predictions on the likelihood of a customer repaying
future loans is determined by the given attributes, and outcomes recorded in the dataset
for each individual.!
The raw data contained 15 attributes of 2000 individuals (instances), a unique six-digit ID
identified the individual customers.
Attributes:!
!
!
Terminology:
1
Task / Scenario / Project all refer to the situation of which this report and
data mining project has been presented. The terminology refers to the
scenario of the financial establishment, and the end goal of predicting the
probable chance of future loans being repaid by customers.!
Financial Establishment refers to the bank in which the data saws supplied
from.!
Record(s) refer to the individual data variables for each row within the
provided dataset.!
Red text within tables (Data Summary section) indicated flagged issues,
which may affect the data, or should be attended to.!
Instances / Customers / Individuals / Person / Applicant all refer to an
individual set of values for one unique person. One person has 15
attributes, and can be identified by their Customer ID.!
Record(s) refer to the results of a particular customer or customers.!
NB found in chapter 4 (table data) is shorthand for Naive Bayes.!
In the Modelling section: D = Defaulted | P = Paid!
Mean refers to the average of a set of numbers.!
Standard Deviation refers to the quantity expressing by how much the
members of a group differ from the mean value for the group.!
Noise: a value that differs from the norm; is error or irrelevant data, it is not
part of the data and should be removed.!
Outliers: a value that significantly differs from the norm; it is not usually
expected however it is part of the data, may be of importance.
Customer ID | Forename | Surname | Age | Gender | Years at Address |!
Employment Status | Country | Current Debt | Postcode | Income |!
Own Home | CCJs | Loan Amount | Outcome |
4. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.0 Data Summary!
2.1 Customer ID!
Numeric, Unique!
Inspection of the dataset found that the Customer ID attribute, a unique numeric value
used by the financier establishment to independently identify a customer, had a number of
duplicate values. The dataset had a 99% unique rate (1986 columns) for the Customer ID
attribute, leaving a remainder of 14 values that were either incorrectly entered in to the
system, or should be merged as the individual customer may have requested multiple
loans.!
It can be deduced from the unique rate for the Customer ID that 7 customers use the same
Customer ID twice each, this may be in error or due to multiple loan requests.!
However this is not the case, as filtering duplicate Customer ID’s from the dataset,
displayed different customers referencing the same unique Customer ID attribute.!
It is important to ensure each customer can be uniquely identified, as the merging of many
customer accounts may effect data, especially if incorrect Customer ID entry is a regular
occurrence. This also limits how effective other attributes are to finding a solution to the
proposed loan repayment issue, as their reliability to the correct customer, and therefore
other relevant attributes, is uncertain.!
!
!
Customer ID
(Attribute)
Missing Distinct Unique
Value 0 1993 1986
Percentage 0% 99% 99%
Customer ID
(Attribute)
Forename Surname Age
684733 Paul Brown 44
684733 Mark Bickertor 27
2
5. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.2 Forename/Surname!
Nominal!
The forename and surname of customers stored in the financial establishments database,
and therefore this dataset, is not an attribute that can be used to provide any statistical
information of value, and has very little relevance to providing probability results for
customers repaying future loans.!
This does not exempt the attributes from containing multiple variations of data, and some
invalid entry types. Due to the nature and irrelevance of naming and gender, ‘R Bundy’ can
be considered a valid entry, however it makes for ‘messy’ data.!
Punctuation can be found within some Surname attributes, such as De,ath and O,sullivan
(ideally O’Sullivan). It would not be a serious loss to remove these entries, due to the size
of the dataset. The more worrying element is the mismatched gender attribute, as Peter
Fewson has been entered as female. It is important to reiterate the lack of importance the
forename and surname attributes contribute, to the overall outcome of predicting future
loan repayments, as naming and gender provide no real usable statistics. !
2.3 Age!
Numeric!
Age appears to be an attribute that has been entered correctly for every customer within
the dataset, age values range from (Ascending) 17 to 89. There are a total of 73 distinct
values. Statistical data provides a greater understanding of the data, displaying a Mean of
52.912 (average age 53 rounded). The standard deviation measuring the spread of data
surrounding the Mean value, is 20.991, presenting a good spread of age values, rather
than a more unreliable clustered spread.!
Probable distributions can be applied to specific subsets of age values i.e. 17 to 20
(youth), or other formed subsets deemed fit to provide the most reliable data, i.e.
deviations of 5 or 10 year age differences.!
Age subsets could provide statistical data on the reliability of these specific age groups,
without any external factors effecting the customer situation. Combining this data with
external factors i.e. debt amount, valuable resulting data could be obtained to present the
probability of these customers repaying future loans.
Customer ID
(Attribute)
Forename Surname Gender
730480 Andrew D Evil F
1040020 David @ M
993790 Ian O,sullivan F
979431 Michael De,ath M
986670 R Bundy M
1037772 Peter Fewson F
3
6. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.4 Gender!
Nominal!
Based on the majority of values for the Gender attribute in the provided dataset, the
guidelines for gender !
Post-analysis, the gender attribute displays a total of nine discrete variations which
represent male and female. The majority of values are presented as ‘M’ and ‘F’ for their
respective genders. The significance of the attribute values ‘H’, ‘D’, ’N' are unknown,
however using the forename attribute, the most appropriate gender will be associated, and
the gender attribute changed accordingly. These values have therefore been associated
as noise, as the data has been entered incorrectly.!
The remaining two attribute variations ‘Male’ and ‘Female’ could simply be altered to
represent ‘M’ and ‘F’, however under further inspection, particularly concerning ‘Female’,
forenames consisted of David, Simon, Stuart and Dan. Assuming data has been entered
incorrectly, alterations to the gender for these four customers has been altered. Gender
data would be entered in the form of a checkbox / radio button (easily misconfigured, and
the forename attribute would be entered manually, it is more probable that the gender
attribute is incorrect. Therefore the four values labelled ‘Female’ have been altered to ‘M’
to represent male, gender data now has two distinct values of ‘F’ and ‘M’, as other
variations have been eliminated.!
The percentage of M and F values representing the gender attribute is extremely similar
for the large dataset (3.1% differentiation). Considering the low percentage bias, the
gender attribute could be used to accurately provide statistics, on the probability of the
repayments of future loans.!
However gender discrimination may be apparent, therefore rendering the gender of
customers irrelevant. As previously mentioned, all data for the gender attribute may be
kept, even if a presumed gender attribute for a customer cannot be entirely accurate. The
role gender plays in the probability of customers repaying future loans can be determined
(near-even split of data), however its usefulness is low or may even be disregarded
entirely.!
Gender is unreliable based on the Forename attribute shown in section 2.2, as Ian and
Andrew are both considered to be female, forename which are generally presumed to be a
male. As described above, the usefulness of the gender attribute is minimal, therefore it is
not worthwhile the clean the data.
Gender
(Attribute)
F M
Count 969 1031
Percentage 48.45% 51.55%
Gender
(Attribute)
F M Male Female H D N 1.0 0.0
Count 968 1017 3 4 2 1 2 2 1
4
7. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.5 Years at Address!
Numeric!
The majority of values for the Years at Address attribute have been entered correctly,
however there are four anomalies, indicating that customers have been living at the
address stated for an unrealistic period of time.!
Given the attribute values shown, it is not conclusive as to whether these four values were
entered incorrectly (additional figure value), or that the figure is intended to represent
number of months, rather than years.!
The four values shown below have been interpreted as noise, as these values are
considerably outside the normal standard range for the ‘years at address’ attribute,
presumed to be invalid data.!
The four customers listed could be removed from the dataset, as the remaining 1996
customers would provide reliable data (the loss of four customers data is minimal).
Alternatively the average Years at Address value could be taken from the 1996 values
(excluding the noisy values), to replace the four values.!
However, Years at Address could be regarded as an attribute that does not affect the end
outcome for the probability of a customer to repay their loan, and therefore could remain
within the dataset. It is unwise to assume what the provided figures are intended to be,
and should not be changed.!
The dataset in its original state has 74 distinct values ranging from minimum 1 year to
maximum 560 years, presuming the above scenario of incorrectly input data, the maximum
years would be 71. The following results are for the unedited dataset that was provided.!
Subsets of 10 year intervals for this attribute could assist in the solution to the issue, of
providing a reliable probability rate of repaying future loans, dependant on years lived at
the current address.
Customer ID Forename Surname Years at Address
1075394 Simon Wallace 560
967482 Brian Humphreys 410
678376 Steve Hughes 300
722046 Chris Greenbank 250
Years at Address Unedited Dateset Edited Dateset
Minimum Value 1 1
Maximum Value 560 71
Number of Values 1996 2000
Distinct Values 74 70
Mean 18.526 17.801
Standard Deviation 23.202 15.764
5
8. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.6 Employment Status!
Nominal!
The discrete values for the Employment Status attribute in the provided dataset include
‘Self Employed’ which is the majority, it is unlikely that 1013 (50.65%) people of 2000 are
self-employed, however there is no evidence to disprove this figure.!
The remaining 3 of the 4 discrete values for this attribute, indicate that 642 (32.1%) people
are Unemployed, 340 (17%) people are Employed and 5 (0.25%) people are Retired.!
The default retirement age (formerly 65) has currently been phased out, now allowing
people to work as long as desired. However for this dataset, the former retirement age of
65 has been used. Only 5 people from the entire dataset are retired, none of which are
over the age of 65.!
The 5 customers who have an Employment Status of retired are considered to be outliers,
and should remain in the dataset. This data will be relevant to determine the percentage of
retired people, who have previously paid debts, and who have previously defaulted.
However, the data amount is considerably small, and should be treated as such (there are
limitations on its reliability).!
2.7 Country!
Nominal!
There are four distinct values for the Country attribute: Germany, France, Spain and UK.
The majority of values (1994) are UK, leaving a very small amount of 6 value (2 per)
relating to the other three remaining countries.!
Due to the extremely small dataset size for countries excluding the UK, it is impossible to
retrieve a reliable result for Germany, France and Spain. Therefore, the only possible use
for the Country attribute would be to find a common trend for customers in the UK,
however it is more than likely that the Country attribute will not be used.!
The histogram (left) indicates that the number of values
for the country attribute that are non-UK, are so few, that
a valid representation of paid and defaulted attributes
cannot be shown. This is the justification for their
removal in the final dataset, a reliable trend in results
cannot be obtained.
Employment Status Self Employed Unemployed Employed Retired
Count 1013 642 340 5
Percentage 50.65% 32.1% 17% 0.25%
Country Germany France Spain UK
Count 2 2 2 1994
Percentage 0.1% 0.1% 0.1% 99.7%
6
9. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.8 Current Debt!
Numeric!
The Current Debt attribute has 788 distinct values with 15% (294) unique values. Current
debt is continuous as there is no finite amount of debt a person can have. Current debt will
make it harder for a person to take out new loans, and repay current loans, the higher the
debt amount is.!
Therefore it can be hypothesised that a person with a high current debt and an income
which is less than their current debt, will not repay future loans, however this is still to be
proven. This attribute will be useful in predicting the probability that a person will repay
future loans, as it provides a statistical figure of past debts. Debt can be interpreted as a
person who struggles with money, whereby a loan is needed in the first place.!
2.9 Postcode!
Nominal!
The Postcode attribute has 1971 distinct values, and has 0% missing values. This
confirms that multiple customers live in the same area (the same postcode) where trend
may develop. However there are too few numbers to create any significant trends, and
much like the gender attribute, discrimination based on living location is not permitted.!
The usefulness of this attribute is therefore nil, and a solution should not regard this
attribute for the probability rate of future loan repayments, based on postcode.!
This attribute may be used in other circumstances outside the given scenario, i.e.
marketing purposes.!
!
Current Debt Unedited Dataset
Minimum Value 0
Maximum Value 9980
Mean 3309.325
Standard Deviation 2980.629
7
10. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.10 Income!
Numeric!
The Income attribute has 100 distinct values with 7 unique values (rounded to 0%), this is
a continuous amount as there is no finite amount of income a person can have, due to
varying salaries, and other sources i.e. commission.!
This attribute has great importance especially when evaluated against Current Debt. It can
be hypothesised that a person with a low income i.e. 3000, who has a similar amount or
higher in current debt, it is unlikely that person will repay future loans.!
Trends may also be established using multiple attributes i.e. the higher a persons income,
the less current debt they have, except for people with an income of £30,000 - £40,000.
This is purely an example, and trends will be acknowledged in following chapters.!
The scatter plot (right) represents loan amount against income, the 2 outliers are shown,
as they significantly vary from the average loan amounts, however there are only 2 values
whereby 1 has paid and 1 has defaulted.!
2.11 Own Home!
Nominal!
The Own Home attribute has 3 distinct values and may provide valuable data in whether
future loan repayments will be secured or not. This attribute in conjunction with another
attribute i.e. Income may provide some solid statistics. Example: people who own their
home, and have an income greater than £30,000, are likely to repay their loans, this is a
hypothesis however, and will be examined in a later chapter.!
The hypothesis stated above in conjunction with logical reasoning can determine that
applicants who own their home with a comfortable income, can likely pay their mortgage
and repay future loans with ease, whereas an applicant who rents their home may
struggle, due to the outgoing payments each month.!
These is no noise for this attribute, as there are only 3 possibilities for home ownership,
and there is 0% missing values.!
Income Unedited Dataset
Minimum Value 3000
Maximum Value 220000
Mean 38319
Standard Deviation 12786.506
Own Home Rent Own Mortgage
Count 292 1031 677
Percentage 14.6% 51.55% 33.85%
8
11. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.12 CCJs!
Numeric!
The CCJs (County Court Judgement) attribute for this dataset has 6 distinct values,
ranging from 0 to 100. A CCJ may be issued if court action has been taken against a
person, claiming the person owe money. A CCJ will be issued if the court has formally
decided a person owes money, and will explain how much money is owed, and the
deadline for payment.!
CCJs suggest a person is unreliable in their repayments of debts, whereby a higher
number indicates a greater risk for loan repayments, making it difficult for a person to
secure loans or credit. Therefore this attribute will be of high value in providing a result for
the probability of future loans being repaid by customers.!
This attribute has 6 distinct values shown below, the record listing a customer with 100
CCJ’s is most likely to be the result of a data error (noise). The likelihood of a person aged
27 with 100 CCJs is low. This record could therefore be altered to be 1 or 10, or deleted as
the actual value cannot be determined for definite, and the removal of 1 value is
insignificant. The same can be said for the person with 10 CCJs, as this is the only other
unique CCJ value, as the remaining CCJ values have many instances.!
!
!
!
!
!
!
!
The histogram (left) shows that customers with lower CCJs (0 - 1) mostly defaulted (RED),
whereas with an increase to the number of CCJs (2 - 3), the majority of customers actually
paid (BLUE).!
The scatter plot (right) represents the number of CCJs against the outcome of either paid
or defaulted. The noise can clearly be seen here, the customer who had 100 CCJs paid,
which is clearly an error in data entry; 100 CCJs is near impossible.!
As the ‘100’ value significantly deviates from the other set numerical values, consistently
alternating between 1 - 3, it may be classed as an outlier if the scenario was different i.e.
the IQ of a scientifically recognised figure, against a group of primary school children.
However as the scenario relates solely to the amount of CCJs, the probability of the 27
year old customer, having 100 CCJs, is extremely unlikely, and has been addressed as
noise.
CCJs 0 1 2 3 10 100
Count 886 497 347 268 1 1
Percentage 44.3% 24.85% 17.35% 13.4% 0.05% 0.05%
9
12. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
2.13 Loan Amount!
Numeric!
Loan Amount is a continuous amount as there is no finite amount of loans a person can
have, unless they are declined loans by the bank or other financial establishment.!
The minimum value present is 13, followed by numerous similar values i.e. 35, 45, 50.
These values are considered to be noise as it is unlikely a bank would loan small amounts
of £13. If the value of 13 is an input error, the actual amount is uncertain, as the value
could have been £1,300 or £13,000.!
The cut-off line for the removal of these values can only be estimated, due to the financial
standards whereby most established bank loans begin at £1,000. However the alternative
is that people may take out payday loans from third-party websites, such as Wonga and
Kwik Cash, whereby these loans are valid amounts.!
2.14 Outcome!
Nominal!
The Outcome attribute has 2 distinct values of either ‘Paid’ or ‘Defaulted’, which provides
credibility on the ability to repay future loans. The outcome attribute is based on whether
the person paid or defaulted on the loan they had, at the collection time of data,
concerning their income, loan amount and current debt.!
The amount of people who paid and defaulted, in both the unedited dataset, and dataset
after manipulation (removal of noise), can be seen in the table below.!
With the manipulated dataset, both the amount of people who paid and defaulted have
been reduced by 3, leaving consistent outcome values, with a cleaner dataset.
Loan Amount Value
Minimum Value 13
Maximum Value 54455
Mean 18929.628
Standard Deviation 12853.189
Outcome Unedited Dataset Edited Dataset
Paid 1118 1115
Defaulted 882 879
10
13. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
3.0 Data Mining Technique Selection!
3.1 Decision Trees!
General!
A decision tree is represented by a leaf and node structure, each attribute is represented
by a node, which then branches multiple times in to further attributes, until a final decision
has been made (a leaf). A common method of producing decision trees is to use TDIDT.!
First Phase!
The TDIDT method (top-down induction of decision trees) has two phases, the first of
which is to build or ‘grow’ the decision tree. This is done by determining the ‘best’ attribute
in the provided dataset, which will become the root node of the decision tree. The best
attribute is the one that partitions the class attribute the most purely. Ideally an attribute
that can present a pure node will be used, however if this is not possible, the most pure
attribute will be used. A pure node indicates that all samples of that node have the same
class label, and there is no need to further split the node in to additional branches.!
Second Phase!
The second phase is to ‘prune’ the tree, which is achieved using a bottom-up method,
alternating from the top-down method in the first phase. Any data that is deemed to be
noise (this may be uncertain but logical deduction should be applied here), should be
pruned, this would remove the node from the decision tree, creating a leaf. Essentially
pruning reduces classification errors in the presence of noisy data.!
Ross Quinlan developed the first classification decision tree known as ID3 (Iterative
Dichotomiser 3) in 1986, which has been developed in to a more general algorithm called
C4.5 (1993), which can use continuous attributes in addition to discrete.!
C4.5 Algorithm!
The C4.5 algorithm is used to generate decision trees used for classification problems
from training sets, which can deal with both continuous and discrete attributes, in addition
to missing values. The decision tree is pruned after creation, replacing irrelevant branches
with leaf nodes.!
The dataset for the given financial establishment includes both continuous and discrete
attributes, which this algorithm now caters for. This method works well with noisy data
which the dataset contains, and may therefore not have to be removed due to the lack of
effect they would have upon the end data result.!
This algorithm can be enhanced post-completion using the pruning method, described
above, and differentiating test options to ‘train the data’ also provide varying outlooks on
the dataset i.e. cross-validation, percentage split and training sets. A training set is a
section of a dataset used on a specific model, to discover potentially predictive
relationships throughout known values of a dataset, for future unknown data [1].!
!
11
14. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Pros and Cons of Decision Trees!
Decision trees are usually easy to interpret visually which is advantageous, however upon
large sets of data, decision trees can become large, even after pruning has been
completed. Therefore over-complex decision trees may be generated, making data
interpretation difficult and ungeneralised.!
Pruning presents the factor that very little data preparation is required. Noise can generally
be dealt with via pruning, and is not critical to the entire accuracy of the decision tree
generated. A white box model is used as the data is known to the tester, and can be
explained using result logic.!
Decision trees are usually a very fast method of data mining, which can classify unknown
records with little effort [9].!
Entropy and Information Gain!
Entropy is a common way to measure impurity, a perfect 50-50 split provides good training
data. Information gain is based on the decrease of entropy (or impurity) after a dataset has
been split on an attribute. This would then find the most homogenous node, providing the
most pure results. The formula below measures the reduction of entropy achieved from a
given split [3].!
Pseudo-Code for C4.5!
The pseudo-code for the C4.5 algorithm is shown below, whereby the algorithm consists of
a set S of examples, the pseudo-code has been referenced [6].!
Overfitting!
Decision Trees grown until each leaf node has the lowest impurity possible may overfit,
whereby the algorithm reduces the training set error at the cost of an increased test set
error. Overfitting generally occurs in complex models with many parameters, noise may be
described instead of legitimate relationships. Post-pruning effectively solves this issue by
allowing the tree to classify the training set, and then the nodes are pruned (post) [8, 9].
12
Check for base case.!
Find the attribute with the highest
informational gain (A_best)!
Partition S into S1,S2,S3... according to the
values of A_best!
Repeat the steps for S1,S2,S3...
15. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
3.2 Naïve Bayes!
General!
Bayesian refers to Reverend Thomas Bayes and his solution to the theory of probability,
known as Bayes’ Theorem published in 1763. Naive Bayes can be used to understand
how the probability that a theory is true, is affected by another piece of evidence [2].!
The Naive Bayes model has been applied in real world scenarios i.e. the averaging to
predict Alzheimer’s disease from genome-wide data [5]. A significant number of records
(1411) made use of the classifiers strengths dealing with large amounts of data.!
Naive Bayes assume conditional independence:!
‘Naive Bayes makes the assumption that each predictor is conditionally independent of the
others. For a given target value, the distribution of each predictor is independent of the other
predictors. In practice, this assumption of independence, even when violated, does not
degrade the model's predictive accuracy significantly, and makes the difference between a
fast, computationally feasible algorithm and an intractable one.’ [4]!
The goal of Naive Bayes is to work out whether a new example is in a class given that it
has a certain combination of attribute values. We work out the likelihood of the example
being in each class given the evidence (its attribute values), and take the highest
likelihood as the classification [3].!
The conditional probability of an event B in relationship
to an event A, is the probability that event B occurs
given that event A has already occurred.!
The formula (right) is for conditional probability, P(B|A)
read as the probability of B given A.!
To calculate the probability of B given A, the algorithm counts the number of cases where
A and B occur together and divides it by the number of cases where A occurs alone [4].!
Pros and Cons of Naive Bayes!
Naive Bayes’ assumption that variable are all equally important and independent of one
another is often untrue. Redundant attributes that are included also hinder Naive Bayes.
Strong dependencies i.e. a person with high income has an expensive house, unfairly
multiplies the effect of having low income. Attribute values will be assigned a zero
probability if they are not present in the data. Small positive values which estimate the so-
called ‘prior probabilities’ are often used to correct this.!
Naive Bayes often provide practical results, and the algorithm is somewhat robust to noise
and irrelevant attributes. A large number of records must be present to achieve the best
results from using this classifier [10].!
The “zero-frequency problem”!
The solution is to never allow zero probability. An attribute value that does not occur with
every class value e.g. (‘Humidity = high’ for class ‘yes’). Pr[Humidity = high | yes] = 0.
The probability will be zero independent of how likely other values are.!
The fix is to add 1 to the count for every attribute value-class combination (Laplace
estimator), this keeps a consistent set of values, whilst never allowing a zero value.
13
16. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
4.0 Data Exploration and Preparation!
4.1 Relevant Data!
As discussed in chapter 1 the attributes that will be used are Age, Years at Address,
Employment Status, Current Debt, Income, Own Home, CCJs, Loan Amount and
Outcome.!
Therefore the attributes that have not been included in data preparation include Customer
ID, Forename, Surname, Gender, Country and Postcode.!
Previous discussions in chapter 1 describe the discrimination surrounding the gender of a
person, and their home location, whereby these attributes are no longer used to evaluate
whether an applicant should be eligible for future loans or not. This effectively relates to
postcode, as the country attribute has very few records outside the UK, that reliable trends
cannot be established.!
4.2 Noise!
The total amount of records post elimination of noisy data, returns a total of 1994, 6
records have been deleted from the dataset.!
Years at Address and CCJs!
The 4 records that have been removed are due to 2 assumed incorrect attributes; CCJs
and Years at Address. The values recorded within these attributes are almost certainly
noise, i.e. 560 for Years at Address and 100 for CCJs. The Years at Address figures could
easily be misinterpreted and entered as months instead of years, and CCJs should not be
extensive figures, therefore it was easier to label these records as noise, and remove
them, as the large dataset will accommodate for very few records being removed.!
The histograms above are created from the edited dataset (removal of noise) on the left,
and the unedited dataset on the right. It is apparent that the CCJ and Years at Address
attributes that have been removed are noise and not outliers, which hold no significance.
14
17. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Gender!
Initial cleanup of data removed noise values to represent valid F and M genders. The
edited dataset histogram is on the left, the unedited dataset is on the right. The removed
values have been labeled as noise as they hold very little significance, 1 and 0 may
represent M and F however these were uncertain, logical deduction has categorised these
uncertain values as M and F respectively, based on the Forename of customers, as this is
the only attribute that may identify the relevant gender.!
There are consistent errors within the Forename, Surname and Gender attributes, as
different formats are used i.e. first initial and complete surname, complete name or
irrelevant punctuation found within names. Gender also suggests irrelevance, as David is
classed as ‘F’ for female.!
4.3 Outliers!
Interesting Attributes (Income and Country)!
The Income attribute may hold some interesting information as Outliers include potentially
include values of 180000 and 220000. It may be possible to deduce the possibility of
whether a customer will default or pay future loans with this attribute.!
Age: 17 | Income: 220000 | CCJs: 1 | Outcome: Defaulted!
Age: 42 | Income: 180000 | CCJs: 0 | Outcome: Paid!
Assumptions may include that a young person with a
high income and previous CCJs is likely to default on
future loan repayments, whereas an older person with a
high income and no previous CCJs is likely to repay
future loans.!
The histograms to the right represent unchanged data in
the edited dataset, as the outliers have not been
deleted, however the country attribute has lost 6 records
due to removal of noise in the CCJ and Years at Address
attributes.!
The other 6 country values will not be used, as there are
not enough records to represent any accurate trends,
with differentiating country dependencies.
15
18. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
5.0 Modelling!
Experiment One!
Investigate Training and Testing Strategies!
Aim: to investigate whether 50:50 or cross-validation gives the best classification accuracy
when using Naive Bayes and J48.!
Methodologies!
The two data mining techniques I have chosen in chapter 3 will be modified by individual
parameters to produce to purest nodes, and therefore the most accurate data (where the
percentage of correctly classified instances is the highest). WEKA’s implementation of the
c4.5 decision tree algorithm is called J48, which will be tested in addition to Naive Bayes.!
The first initial experiment will determine whether a 50:50 split or cross validation training
method will produce the highest percentage of correctly classified instances.!
50:50 Percentage Split!
This testing methods splits the dataset in to sections based on the percentage chosen,
whereby the algorithm applies rules learnt from the training data, then applies the rules to
the testing data, which will then present the results to be used.!
A 50:50 split seems to be the most reliable as the entire dataset is evenly split, without
bias. A 90:10 split produces a slightly higher classification accuracy, however the sample
size is too small to form a valid opinion based on reliable data.!
The 50:50 split (although producing a slightly lower classification accuracy) has a reliable
sample size, and evenly distributes (splits) training and testing data.!
Cross-Validation (10 fold)!
This testing method splits the dataset in to (my desired number of folds) 10 equal sections,
the algorithm learns from the first section or fold, and applies the ‘rules’ its learnt to the
other remaining folds.!
‘Extensive tests on numerous different datasets, with different learning techniques, have shown
that 10 is about the right number of folds to get the best estimate of error.’ [7]
If the dataset were the total 2000 records, of 10 folds, each fold is divided into two groups;
1800 records used for training and 200 records used for testing.!
A classifier is produced with an algorithm from the 1800 records, and applied to the 200
records for each fold. Classifiers are produced for each fold, whereby the average number
of cases predicted incorrectly is used, which produces the classification accuracy
percentage and confusion matrix. The higher the correctly classified instance percentage,
the more accurate the test is.!
The results from the 2 testing options for each methodology will be compared for both J48
and Naive Bayes. I will then continue to use the 2 methodologies with the testing option
that provided the highest correctly classified instance percentage, for the remainder of the
experiments i.e. Naive Bayes may be more accurate using crossfold-validation whereas
J48 may be more accurate using percentage split.
16
19. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
For the initial test all parameters for both J48 and Naive Bayes were fixed to their defaults
as stated in WEKA, the screenshots below indicate the parameter defaults. The
parameters will be changed individually in later experiments to determine which is the
optimal classifier, by enhancing (increasing) the correctly classified instance percentage. !
The table below indicated the training and
testing methods used on the J48 and Naive
Bayes, and their classification accuracy.
The percentage of records for the paid and
defaulted attributes are also shown.!
Green / red cells indicate the training and
testing methods that will be used and
ignored (respectively) for other experiments.!
For the following experiments concerning J48, only the 50:50 percentage split training
method will be used, as this provided the highest classification accuracy percentage.!
!
Similarly, for Naive Bayes, experiments will continue to use the 50:50 percentage split
training method, as this too provided the highest classification accuracy percentage.
Model
Training
Testing
Paid (%) Default (%)
Classification
Accuracy (%)
J48 50:50 88.13 70.04 80.44
J48 CV 84.39 72.01 78.93
NB 50:50 90.99 64.85 76.62
NB CV 83.58 63.48 74.72
17
20. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Decision Tree!
The decision tree in its entirety was too
large to show on screen, the image to
the left shows a section of the decision
tree.!
The decision tree shown related to the
J48 algorithm, using the percentage split
training method.!
!
!
!
!
!
!
!
Confusion Matrix!
The confusion Matrices below are labelled accordingly to
the data mining techniques chosen, and training methods
used. The confusion Matrices (right) generate the
following percentages of correctly classified paid and
defaulted records (indicated by P and D respectively):!
On average J48 (Percentage Split) has the highest
average accuracy of correctly classified instances, of paid and defaulted records, with a
total figure of 79%. These figures are reliable as it is extremely close to the classification
accuracy of the model (shown on the previous page).
18
J48 PS [D = 70%] [P = 88%]
J48 CV [D = 72%] [P = 84%]!
NB PS [D = 64%] [P = 85%]
NB CV [D = 63%] [P = 83%]
21. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Parameters!
The two images indicate the parameters that
can be changed (one at a time) for each
classifier, that will be performed in experiments
2, 3, 4 and 5. J48 (right) will have parameters
binarySplits and unpruned changed. Naive
Bayes (left) will have parameters
useKernelEstimator and
useSupervisedDiscretisation changed.!
!
!
!
!
!
!
Parameters were changed one at a time, the confusion matrix and correctly classified
instances percentage were noted. The parameter was then changed back to its default
setting, allowing for another parameter to be changed. This ensured greater reliability, as if
multiple parameters were changed at the same time, it would be difficult to determine
which parameter had the most noticeable effect on the overall outcome.
19
22. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Experiment Two (J48)!
Investigate the Effectiveness of Pruning!
Aim: to investigate whether using pruning gives a better classification accuracy.!
Methodology!
Change Unpruned to true, to determine how the correct classification percentage changes.!
Parameter!
By default, the ‘Unpruned’ parameter is set to false, indicating the decision tree has been
pruned. This experiment will change the parameters value to true, to investigating whether
an unpruned tree will give a higher classification accuracy percentage.!
The decision tree
generated using the J48
algorithm, 50:50
percentage split training
method, has a higher
classification accuracy
when pruning is present.
Un-pruning the tree
drops the accuracy by
2.61%, to a final figure
of 77.83, from 80.44.!
!
!
The confusion matrix above is highly
accurate if the company significantly cared
about customers who have paid their
loans, however the correctly classified
accuracy percentage for defaulted records
is quite low.!
D = 64% P = 87%!
The unpruned decision tree increases its
size significantly, shown (left), whereby
Employment Status shows each of the 4
possible attribute values.!
This form of generated decision tree will
have greater value if the company decides
to drill down in to specific attributes held
by customers, for marketing purposes.!
I.e. if the company wishes to view records
of customers who are retired, and have a
certain loan amount greater than a set
figure, an unpruned tree is the best option.
20
23. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Experiment Three (J48)!
Investigate the Effectiveness of Binary Splits!
Aim: to investigate whether using binary splits gives a better classification accuracy.!
Methodology!
Change Binary Splits to true, to determine the correct classification percentage change.!
Parameter!
By default the Binary Splits parameter is set to false, turning this parameter to true will use
binary splits on nominal attributes when building the decision tree.!
The decision tree
generated using the J48
algorithm, 50:50
percentage split training
method, has a higher
classification accuracy
when binary splits are not
used. Using binary splits
drops the accuracy by
0.71%, to a final figure of
79.73, from 80.44.!
Binary Splits also
generates a high accuracy
percentage for the paid
attribute, however the defaulted accuracy
is still quite low, higher than the pruning
parameter yet lower than the default
parameter settings.!
D = 69% P = 87%!
The tree has been pruned as the
unpruned tree generated a lower
correctly classified instance percentage.!
As the accuracy was less reliable,
reverting the pruning parameter, then
changing a second parameter; enabling
binary splits, generated a higher correctly
classified instance percentage, however
it was still less than the complete default
set of parameters.!
However when binary splits are enabled
the decision tree is a lot more readable.
The specifics of attributes are not as
visible i.e. retired and self employed
attribute, as only employed and not
employed are visible.
21
24. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Experiment Four (Naive Bayes)!
Investigate the Effectiveness of Kernel Estimator!
Aim: to investigate whether using kernel estimator gives a better classification accuracy.!
Methodology!
Change using Kernel Estimator to true, to determine how the correctly classified instances
percentage changes, if the change positively effects the accuracy, it will be kept, however
a negative effect will ensure the parameter change is reverted.!
Parameter!
By default the use of Kernel Estimator parameter is set to false, turning this parameter to
true will use kernel estimator for numeric attributes rather than a normal distribution.!
The Naive Bayes classifier
using 50:50 percentage split
training method, has a higher
classification accuracy when
kernel estimator is not used.
Using kernel estimator drops
the accuracy by 0.8%, to a
final figure of 75.82, from
76.62.!
The Kernel Estimator
compensates for attributes
which in the dataset are not
normalised, attempting to
provide smoothing to the
data.!
!
Using Kernel Estimator lowered the classifier accuracy percentage for both paid and
defaulted values, and therefore should be avoided, as enabling this parameter provided no
beneficial factors.!
D = 64% P = 84%!
Workings!
Defaulted: 273 / (273 + 151) = 0.64!
Paid: 483 / (483 + 90) = 0.84!
Using the sum of the two numbers generated:!
(0.64 + 0.84) / 2 = 0.74 * 100 = 74 (74 is the percentage value)!
This number is significantly close to the correctly classified instance accuracy of 75.82%
22
25. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
Experiment Five (Naive Bayes)!
Investigate the Effectiveness of Supervised Discretisation!
Aim: to investigate whether using supervised discretisation gives a better classification
accuracy.!
Methodology!
Change the Supervised Discretisation parameter to be true, to determine whether this
positively effects the correctly classified instance percentage, if it does, the parameter
change will be kept, if it negatively effects the accuracy, the change will be reverted.!
Parameter!
By default the Supervised Discretisation parameter is set to false, turning this parameter to
true will use supervised discretisation to convert numeric attributes to nominal ones.!
The Naive Bayes classifier
using 50:50 percentage split
training method, has a
higher classification
accuracy when supervised
discretisation is not used.
Using supervised
discretisation drops the
accuracy by 2.61%, to a
final figure of 77.83, from
76.62.!
!
!
!
!
Enabling Supervised Discretisation is the only experiment that positively impacted the
classification accuracy of the percentage for both the Paid and Defaulted attributes. The
following percentages are Naive Bayes with all default parameters, with exception to
Supervised Discretisation.!
D = 65% P = 86%!
Enabling the parameter saw an increase of 1% accuracy for both paid and defaulted
attributes up from 64% and 85% respectively.
23
26. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
6.0 Conclusion!
The table below conclusively shows that J48, 50:50 percentage split, with default
parameters, correctly classified the most instances of paid and defaulted attributes.!
J48 is the best classifier all round as no matter what parameters were changed, the
outcome also favoured the J48 classifier over Naive Bayes in terms of accuracy.!
The only experiment, and change in parameters that positively effected either of the
classifiers, was the enabling of supervised discretisation for Naive Bayes. Enabling this
parameter increased the accuracy of both paid and defaulted instances by 1%.!
However this still lacked behind the J48 classifier, which correctly classified paid instances
2% more often, and defaulted instances 5% more often.!
6.1 Final Results!
If the financial establishment were interested in specific information for marketing
purposes, or specialised events that may only concern a niche group of customers, J48
multiway splits may be the most beneficial, as binary splits generate a more readable
(small-scale) decision tree, that categorises attributes more aggressively.!
The highest Paid and Defaulted classification accuracy percentages have both been
achieved by using the J48 classifier, 50:50 percentage split training method, without the
default parameters. Parameter alternations only negatively effected the correct
classification accuracy, and therefore should not be used.!
If the results that the financial establishment was interested in solely concerned defaulted
customers, the highest accuracy percentage would concern the defaulted attribute,
unconditional of how low the accuracy percentage was concerning the paid attribute
i.e D = 84% P = 62%.!
However if the financial establishment wished to achieve reliable figures for both paid and
defaulted attributes, a more evenly matched set of figures should be aimed for
i.e. D = 75% P = 78%.!
In either situation, the J48 classifier with default parameters provides the most reliable
overall results.
Experiments
J48 (PS)
Paid %
Naive Bayes (PS)
Paid %
J48 (PS)
Defaulted %
Naive Bayes (PS)
Defaulted %
Classifier
Default
88 85 70 64
Pruning 87 - 64 -
Binary Splits 87 - 69 -
Kernel
Estimator
- 84 - 64
Supervised
Discretisation
- 86 - 65
24
27. Data Engineering (6G6Z1006) - 1CWK50 - 11033545 Darran Mottershead
7.0 References!
[1]! Statistics (2014)
[Online] Available from: http://www.statistics.com/glossary&term_id=864
[Accessed 09 March 2014]!
[2]! Trinity (2014)
[Online] Available from: http://www.trinity.edu/cbrown/bayesweb/
[Accessed 09 March 2014]!
[3]! Gerber, L (2014)
[Online] Available from: MMU Data Engineering (6G6Z1006_1314_9Z6)
[Accessed 09 March 2014]!
[4]! Oracle Docs (2008)
[Online] Available from: http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/
algo_nb.htm#BGBEEFIA
[Accessed 09 March 2014]!
[5]! Wei. W, Visweswaran. S, Cooper. G (2011)
[Online] Available from: http://jamia.bmj.com/content/18/4/370.abstract
[Accessed 10 March 2014]!
[6]! Octaviansima (2011)
[Online] Available from: http://octaviansima.wordpress.com/2011/03/25/decision-trees-
c4-5/
[Accessed 10 March 2014]!
[7]! Witten. H, Elbe. F, Hall. M (2011) Data Mining: Practical Machine Learning Tools
and Techniques : Practical Machine Learning Tools and Techniques
[Online] Available from: http://lib.myilibrary.com/Open.aspx?id=295388&src=0
[Accessed 10 March 2014]!
[8]! CSE.MSU.EDU
[Online] Available from: http://www.cse.msu.edu/~cse802/DecisionTrees.pdf
[Accessed 10 March 2014]!
[8]! Sayad. S (2012) An Introduction to Data Mining
[Online] Available from: http://www.saedsayad.com/decision_tree_overfitting.htm
[Accessed 10 March 2014]!
[8]! Sayad. S (2012) An Introduction to Data Mining
[Online] Available from: http://www.saedsayad.com/decision_tree_overfitting.htm
[Accessed 10 March 2014]!
[9]! Pedregosa (2012) Scikit-learn: Machine Learning in Python
[Online] Available from: http://scikit-learn.org/stable/modules/tree.html
[Citation] http://scikit-learn.org/stable/about.html#citing-scikit-learn
[Accessed 10 March 2014]!
[10] Safari Books Online
[Online] Available from: http://my.safaribooksonline.com/book/databases/business-
intelligence/9780470526828/naive-bayes/advantages_and_shortcomings_of_the_naive
[Accessed 12 March 2014]
25