Bankruptcy is a legal procedure that claims a person or organization as a debtor. It is essential to
ascertain the risk of bankruptcy at initial stages to prevent financial losses. In this perspective, different
soft computing techniques can be employed to ascertain bankruptcy. This study proposes a bankruptcy
prediction system to categorize the companies based on extent of risk. The prediction system acts as a
decision support tool for detection of bankruptcy
Running with Elephants: Predictive Analytics with HDInsightChris Price
Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of the demo as your clickstream data at our fictional e-commerce website drives user and product recommendations using the built-in Mahout (Taste) algorithms. In this action pack session, real-world and practical solutions for moving data into and out of HDFS (with Sqoop), using Mongo or HBase as a source/destination and of course handling Mahout processing in distributive mode will all be covered.
Loan Prediction system is a system which provides you a interface for loan approval to the applicants application of loan. Applicants provides the system about their personal information and according to their information system gives his status of availability of loan.
Loan default prediction with machine language Aayush Kumar
Deafult-Loan-Prediction-Project-Using-Random-Forest-and-Decision-Tree
Deafult Loan Prediction Project Using Random Forest and Decision Tree, In This Project we use loan data from Leanding Club Random Forest Project - Deafult Loan Prediction For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Loan Default Prediction with Machine LearningAlibaba Cloud
See webinar recording of this presentation at: https://resource.alibabacloud.com/webinar/detail.htm?webinarId=50
This webinar is designed to help users understand the end-to-end data science processes of using a propensity model on Alibaba Cloud’s Machine Learning Platform for AI; from defining the business problem, exploratory data analysis, data processing, model training to testing and deployment. You get an end-to-end case study (including a live demo) on how to use Alibaba Cloud products to predict the propensity of loan defaults.
Learn more about Machine Learning Platform for AI:
https://www.alibabacloud.com/product/machine-learning
Running with Elephants: Predictive Analytics with HDInsightChris Price
Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of the demo as your clickstream data at our fictional e-commerce website drives user and product recommendations using the built-in Mahout (Taste) algorithms. In this action pack session, real-world and practical solutions for moving data into and out of HDFS (with Sqoop), using Mongo or HBase as a source/destination and of course handling Mahout processing in distributive mode will all be covered.
Loan Prediction system is a system which provides you a interface for loan approval to the applicants application of loan. Applicants provides the system about their personal information and according to their information system gives his status of availability of loan.
Loan default prediction with machine language Aayush Kumar
Deafult-Loan-Prediction-Project-Using-Random-Forest-and-Decision-Tree
Deafult Loan Prediction Project Using Random Forest and Decision Tree, In This Project we use loan data from Leanding Club Random Forest Project - Deafult Loan Prediction For this project we will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as an investor you would want to invest in people who showed a profile of having a high probability of paying you back. We will try to create a model that will help predict this.
Loan Default Prediction with Machine LearningAlibaba Cloud
See webinar recording of this presentation at: https://resource.alibabacloud.com/webinar/detail.htm?webinarId=50
This webinar is designed to help users understand the end-to-end data science processes of using a propensity model on Alibaba Cloud’s Machine Learning Platform for AI; from defining the business problem, exploratory data analysis, data processing, model training to testing and deployment. You get an end-to-end case study (including a live demo) on how to use Alibaba Cloud products to predict the propensity of loan defaults.
Learn more about Machine Learning Platform for AI:
https://www.alibabacloud.com/product/machine-learning
This slide discuss predictive data analytics models and their applications in broader content. It gives simple examples of regression and classification.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
A presentation to the FSN-Elite Conference on the Future of Finance in London. Discuses how developments in data science will radically change the finance function and analysis. Part of this presentation challenges the core of how Finance and Accounting manage their data.
Credit Card Fraud Detection Using ML In DatabricksDatabricks
In the Credit Card Companies, illegitimate credit card usage is a serious problem which results in a need to accurately detect fraudulent transactions vs non-fraudulent transactions. All organizations can be hugely impacted by fraud and fraudulent activities, especially those in financial services. The threat can originate from internal or external, but the effects can be devastating – including loss of consumer confidence, incarceration for those involved, even up to downfall of a corporation. Despite regular fraud prevention measures, these are constantly being put to the test in an attempt to beat the system.
Fraud detection is a task of predicting whether a card has been used by the cardholder. One of the methods to recognize fraud card usage is to leverage Machine Learning (ML) models. In order to more dynamically detect fraudulent transactions, one can train ML models on a set of dataset including credit card transaction information as well as card and demographic information of the owner of the account. This will be our goal of the project while leveraging Databricks.
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
Researchers Navin Kasa, Andrew Dahbura, and Charishma Ravoori undertook a capstone project—part of the UVA Data Science Institute Master of Science in Data Science program—that addresses credit card fraud detection through a semi-supervised approach, in which clusters of account profiles are created and used for modeling classifiers.
As we move forward into the digital age, One of the modern innovations we’ve seen is the creation of Machine Learning. This incredible form of artificial intelligence is already being used in various industries and professions. For Example, Image and Speech Recognition, Medical Diagnosis, Prediction, Classification, Learning Associations, Statistical Arbitrage, Extraction, Regression. Today we’re looking at all these Machine Learning Applications in today’s modern world.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Chatbots have entered our lives unknowingly. Little do we realize that when that lil window pops up asking if we need support or help- it could just be a chatbot that we are talking to...
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWSIJDKP
Data mining can be evaluated as a strategic tool to determine the customer profiles in order to learn
customer expectations and requirements. Airline customers have different characteristics and if passenger
reviews about their trip experiences are correctly analyzed, companies can increase customer satisfaction
by improving provided services. In this study, we investigate customer review data for in-flight services of
airline companies and draw customer models with respect to such data. In this sense, we apply two
approaches as feature-based and clustering-based modelling. In feature-based modelling, customers are
grouped into categories based on features such as cabin flown types, experienced airline companies. In
clustering-based modelling, customers are first clustered via k-means clustering and then modeled. We
apply multivariate regression analysis to model customer groups in both cases. During this, we try to
understand how customers evaluate the given services and what dominant characteristics of in-flight
services can be from the customer viewpoint.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
This slide discuss predictive data analytics models and their applications in broader content. It gives simple examples of regression and classification.
Predicting the Credit Defaulter is a perilous task of Financial Industries like Banks. Ascertainingnon payer
before giving loan is a significant and conflict-ridden task of the Banker. Classification techniques
are the better choice for predictive analysis like finding the claimant, whether he/she is an unpretentious
customer or a cheat. Defining the outstanding classifier is a risky assignment for any industrialist like a
banker. This allow computer science researchers to drill down efficient research works through evaluating
different classifiers and finding out the best classifier for such predictive problems. This research
work investigates the productivity of LADTree Classifier and REPTree Classifier for the credit risk prediction
and compares their fitness through various measures. German credit dataset has been taken and used
to predict the credit risk with a help of open source machine learning tool.
A presentation to the FSN-Elite Conference on the Future of Finance in London. Discuses how developments in data science will radically change the finance function and analysis. Part of this presentation challenges the core of how Finance and Accounting manage their data.
Credit Card Fraud Detection Using ML In DatabricksDatabricks
In the Credit Card Companies, illegitimate credit card usage is a serious problem which results in a need to accurately detect fraudulent transactions vs non-fraudulent transactions. All organizations can be hugely impacted by fraud and fraudulent activities, especially those in financial services. The threat can originate from internal or external, but the effects can be devastating – including loss of consumer confidence, incarceration for those involved, even up to downfall of a corporation. Despite regular fraud prevention measures, these are constantly being put to the test in an attempt to beat the system.
Fraud detection is a task of predicting whether a card has been used by the cardholder. One of the methods to recognize fraud card usage is to leverage Machine Learning (ML) models. In order to more dynamically detect fraudulent transactions, one can train ML models on a set of dataset including credit card transaction information as well as card and demographic information of the owner of the account. This will be our goal of the project while leveraging Databricks.
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
Improving Credit Card Fraud Detection: Using Machine Learning to Profile and ...Melissa Moody
Researchers Navin Kasa, Andrew Dahbura, and Charishma Ravoori undertook a capstone project—part of the UVA Data Science Institute Master of Science in Data Science program—that addresses credit card fraud detection through a semi-supervised approach, in which clusters of account profiles are created and used for modeling classifiers.
As we move forward into the digital age, One of the modern innovations we’ve seen is the creation of Machine Learning. This incredible form of artificial intelligence is already being used in various industries and professions. For Example, Image and Speech Recognition, Medical Diagnosis, Prediction, Classification, Learning Associations, Statistical Arbitrage, Extraction, Regression. Today we’re looking at all these Machine Learning Applications in today’s modern world.
Dear students get fully solved assignments
Send your semester & Specialization name to our mail id :
“ help.mbaassignments@gmail.com ”
or
Call us at : 08263069601
(Prefer mailing. Call in emergency )
Chatbots have entered our lives unknowingly. Little do we realize that when that lil window pops up asking if we need support or help- it could just be a chatbot that we are talking to...
UNDERSTANDING CUSTOMERS' EVALUATIONS THROUGH MINING AIRLINE REVIEWSIJDKP
Data mining can be evaluated as a strategic tool to determine the customer profiles in order to learn
customer expectations and requirements. Airline customers have different characteristics and if passenger
reviews about their trip experiences are correctly analyzed, companies can increase customer satisfaction
by improving provided services. In this study, we investigate customer review data for in-flight services of
airline companies and draw customer models with respect to such data. In this sense, we apply two
approaches as feature-based and clustering-based modelling. In feature-based modelling, customers are
grouped into categories based on features such as cabin flown types, experienced airline companies. In
clustering-based modelling, customers are first clustered via k-means clustering and then modeled. We
apply multivariate regression analysis to model customer groups in both cases. During this, we try to
understand how customers evaluate the given services and what dominant characteristics of in-flight
services can be from the customer viewpoint.
With ever increasing number of documents on web and other repositories, the task of organizing and
categorizing these documents to the diverse need of the user by manual means is a complicated job, hence
a machine learning technique named clustering is very useful. Text documents are clustered by pair wise
similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results
are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence
for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link
specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as
neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed
documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to
find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a
significant improvement in terms of accuracy with minimum time.
STUDENTS’ PERFORMANCE PREDICTION SYSTEM USING MULTI AGENT DATA MINING TECHNIQUEIJDKP
A high prediction accuracy of the students’ performance is more helpful to identify the low performance students at the beginning of the learning process. Data mining is used to attain this objective. Data mining techniques are used to discover models or patterns of data, and it is much helpful in the decision-making.Boosting technique is the most popular techniques for constructing ensembles of classifier to improve the classification accuracy. Adaptive Boosting (AdaBoost) is a generation of boosting algorithm. It is used for
the binary classification and not applicable to multiclass classification directly. SAMME boosting
technique extends AdaBoost to a multiclass classification without reduce it to a set of sub-binaryclassification.In this paper, students’ performance prediction system usingMulti Agent Data Mining is proposed to predict the performance of the students based on their data with high prediction accuracy and provide helpto the low students by optimization rules.The proposed system has been implemented and evaluated by investigate the prediction accuracy ofAdaboost.M1 and LogitBoost ensemble classifiers methods and with C4.5 single classifier method. The results show that using SAMME Boosting technique improves the prediction accuracy and outperformed
C4.5 single classifier and LogitBoost.
Arabic words stemming approach using arabic wordnetIJDKP
The big growth of the Arabic internet content in the last years has raised up the need for an effective
stemming techniques for Arabic language. Arabic stemming algorithms can be ranked, according to three
category, as root-based approach (ex. Khoja); stem-based approach (ex. Larkey); and statistical approach
(ex. N-Garm). However, no stemming of this language is perfect: The existing stemmers have a low
efficiency. In this paper, we introduce a new stemming technique for Arabic words that also solve the
problem of the plural form of irregular nouns in Arabic language, which called broken plural. The
proposed stem extractor provides very accurate results in comparisons with other algorithms.
Consequently the search effectiveness improved.
DATA MINING IN EDUCATION : A REVIEW ON THE KNOWLEDGE DISCOVERY PERSPECTIVEIJDKP
Knowledge Discovery in Databases is the process of finding knowledge in massive amount of data where
data mining is the core of this process. Data mining can be used to mine understandable meaningful patterns from large databases and these patterns may then be converted into knowledge.Data mining is the process of extracting the information and patterns derived by the KDD process which helps in crucial decision-making.Data mining works with data warehouse and the whole process is divded into action plan to be performed on data: Selection, transformation, mining and results interpretation. In this paper, we have reviewed Knowledge Discovery perspective in Data Mining and consolidated different areas of data
mining, its techniques and methods in it.
Clustering the results of a search helps the user to overview the information returned. In this paper, we
look upon the clustering task as cataloguing the search results. By catalogue we mean a structured label
list that can help the user to realize the labels and search results. Labelling Cluster is crucial because
meaningless or confusing labels may mislead users to check wrong clusters for the query and lose extra
time. Additionally, labels should reflect the contents of documents within the cluster accurately. To be able
to label clusters effectively, a new cluster labelling method is introduced. More emphasis was given to
/produce comprehensible and accurate cluster labels in addition to the discovery of document clusters. We
also present a new metric that employs to assess the success of cluster labelling. We adopt a comparative
evaluation strategy to derive the relative performance of the proposed method with respect to the two
prominent search result clustering methods: Suffix Tree Clustering and Lingo.
we perform the experiments using the publicly available Datasets Ambient and ODP-239
With the development of database, the data volume stored in database increases rapidly and in the large
amounts of data much important information is hidden. If the information can be extracted from the
database they will create a lot of profit for the organization. The question they are asking is how to extract
this value. The answer is data mining. There are many technologies available to data mining practitioners,
including Artificial Neural Networks, Genetics, Fuzzy logic and Decision Trees. Many practitioners are
wary of Neural Networks due to their black box nature, even though they have proven themselves in many
situations. This paper is an overview of artificial neural networks and questions their position as a
preferred tool by data mining practitioners.
Enhancement techniques for data warehouse staging areaIJDKP
Poor performance can turn a successful data warehousing project into a failure. Consequently, several
attempts have been made by various researchers to deal with the problem of scheduling the Extract-
Transform-Load (ETL) process. In this paper we therefore present several approaches in the context of
enhancing the data warehousing Extract, Transform and loading stages. We focus on enhancing the
performance of extract and transform phases by proposing two algorithms that reduce the time needed in
each phase through employing the hidden semantic information in the data. Using the semantic
information, a large volume of useless data can be pruned in early design stage. We also focus on the
problem of scheduling the execution of the ETL activities, with the goal of minimizing ETL execution time.
We explore and invest in this area by choosing three scheduling techniques for ETL. Finally, we
experimentally show their behavior in terms of execution time in the sales domain to understand the impact
of implementing any of them and choosing the one leading to maximum performance enhancement.
One of the most important problems in modern finance is finding efficient ways to summarize and visualize
the stock market data to give individuals or institutions useful information about the market behavior for
investment decisions Therefore, Investment can be considered as one of the fundamental pillars of national
economy. So, at the present time many investors look to find criterion to compare stocks together and
selecting the best and also investors choose strategies that maximize the earning value of the investment
process. Therefore the enormous amount of valuable data generated by the stock market has attracted
researchers to explore this problem domain using different methodologies. Therefore research in data
mining has gained a high attraction due to the importance of its applications and the increasing generation
information. So, Data mining tools such as association rule, rule induction method and Apriori algorithm
techniques are used to find association between different scripts of stock market, and also much of the
research and development has taken place regarding the reasons for fluctuating Indian stock exchange.
But, now days there are two important factors such as gold prices and US Dollar Prices are more
dominating on Indian Stock Market and to find out the correlation between gold prices, dollar prices and
BSE index statistical correlation is used and this helps the activities of stock operators, brokers, investors
and jobbers. They are based on the forecasting the fluctuation of index share prices, gold prices, dollar
prices and transactions of customers. Hence researcher has considered these problems as a topic for
research.
Comparison between riss and dcharm for mining gene expression dataIJDKP
Since the rapid advance of microarray technology, gene expression data are gaining recent interest to
reveal biological information about genes functions and their relation to health. Data mining techniques
are effective and efficient in extracting useful patterns. Most of the current data mining algorithms suffer
from high processing time while generating frequent itemsets. The aim of this paper is to provide a
comparative study of two Closed Frequent Itemsets algorithms (CFI), dCHARM and RISS. They are
examined with high dimension data specifically gene expression data. Nine experiments are conducted with
different number of genes to examine the performance of both algorithms. It is found that RISS outperforms
dCHARM in terms of processing time..
Recommendation system using bloom filter in mapreduceIJDKP
Many clients like to use the Web to discover product details in the form of online reviews. The reviews are
provided by other clients and specialists. Recommender systems provide an important response to the
information overload problem as it presents users more practical and personalized information facilities.
Collaborative filtering methods are vital component in recommender systems as they generate high-quality
recommendations by influencing the likings of society of similar users. The collaborative filtering method
has assumption that people having same tastes choose the same items. The conventional collaborative
filtering system has drawbacks as sparse data problem & lack of scalability. A new recommender system is
required to deal with the sparse data problem & produce high quality recommendations in large scale
mobile environment. MapReduce is a programming model which is widely used for large-scale data
analysis. The described algorithm of recommendation mechanism for mobile commerce is user based
collaborative filtering using MapReduce which reduces scalability problem in conventional CF system.
One of the essential operations for the data analysis is join operation. But MapReduce is not very
competent to execute the join operation as it always uses all records in the datasets where only small
fraction of datasets are applicable for the join operation. This problem can be reduced by applying
bloomjoin algorithm. The bloom filters are constructed and used to filter out redundant intermediate
records. The proposed algorithm using bloom filter will reduce the number of intermediate results and will
improve the join performance.
The Innovator’s Journey: Asset Owners Insights State Street
On behalf of State Street, Longitude conducted a global survey of senior executives at investment
organizations during October and November 2014. We asked them to self-assess their confidence and
progress across six data capabilities, including infrastructure, insight, adaptability, compliance, talent and
governance. The 400 respondents were drawn from 11 countries and included insurance companies,
private and public pension funds, fund-of-funds, foundations, central banks, endowments, sovereign
wealth funds and supranationals. One hundred asset owners participated in the survey.
WEB-BASED DATA MINING TOOLS : PERFORMING FEEDBACK ANALYSIS AND ASSOCIATION RU...IJDKP
This paper aims to explain the web-enabled tools for educational data mining. The proposed web-based
tool developed using Asp.Net framework and php can be helpful for universities or institutions providing
the students with elective courses as well improving academic activities based on feedback collected from
students. In Asp.Net tool, association rule mining using Apriori algorithm is used whereas in php based
Feedback Analytical Tool, feedback related to faculty and institutional infrastructure is collected from
students and based on that Feedback it shows performance of faculty and institution. Using that data, it
helps management to improve in-house training skills and gains knowledge about educational trends which
is to be followed by faculty to improve the effectiveness of the course and teaching skills.
Interpreting available data is a focal issue in data mining. Gathering of primary data is a difficult and expensive affair for assessing the trends for any business decision especially when multiple players are present. There is no uniform formula-type work procedure to deduce information from a vast data set especially if the data formats in the secondary sources are not uniform and need enormous cleansing to
mend the data for statistical analysis. In this paper, an incremental approach to cleanse data using a simple yet extended procedure is presented and it is shown how to deduce conclusions to facilitate business decisions. Freely available Indian Telecom Industry’s data over a year is used to illustrate this process. It is shown how to conclude the superiority of one telecom service provider over the others comparing different parameters like network availability, customer service quality etc. using a relative parameter
quantification technique. It is found that this method is computationally less costly than the other known
methods.
For a human body to function properly it is essential to have a certain amount of body fat. Fat serves to
manage body temperature, pads and protects the organs. Fat is the fundamental type of the body's vitality
stockpiling. It is important to have a healthy amount of body fat. Overabundance of fat quotient can build
danger of genuine wellbeing issues. Anthropometry is a broadly accessible and basic strategy for the
appraisal of body composition. Anthropometry measures are weight, height, Body Mass Index (BMI),
waist, boundary, biceps, skinfold etc. The human fat percentage is figured by taking anthropometric
variables. We proposed a methodology to determine the body fat percentage using R programming and
regression formula. We analyzed 10 anthropometric variables and 3 demographic variables. Our study
shows that the impact of certain variables has an edge over other in predicting body fat percentage.
A novel algorithm for mining closed sequential patternsIJDKP
Sequential pattern mining algorithms produce an exponential number of sequential patterns when mining
long patterns or at low support thresholds. Most of the existing algorithms mine the full set of sequential patterns. However, it is sufficient to mine closed sequential patterns from which the total set of sequential patterns can be derived and the closed sequential patterns set is more compact than the sequential
patterns set. In this paper, we propose a novel algorithm NCSP for mining closed sequential patterns in large sequences databases. To the best of our knowledge, our algorithm is the first algorithm that utilizes vertical bitmap representation for closed sequential pattern mining. The results show that the proposed algorithm NCSP can find closed sequential patterns efficiently and outperforms CloSpan by an order of magnitude.
A statistical data fusion technique in virtual data integration environmentIJDKP
Data fusion in the virtual data integration environment starts after detecting and clustering duplicated
records from the different integrated data sources. It refers to the process of selecting or fusing attribute
values from the clustered duplicates into a single record representing the real world object. In this paper, a
statistical technique for data fusion is introduced based on some probabilistic scores from both data
sources and clustered duplicates
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMSIJCI JOURNAL
This paper is written for predicting Bankruptcy using different Machine Learning Algorithms. Whether the company will go bankrupt or not is one of the most challenging and toughest question to answer in the 21st Century. Bankruptcy is defined as the final stage of failure for a firm. A company declares that it has gone bankrupt when at that present moment it does not have enough funds to pay the creditors. It is a global
problem. This paper provides a unique methodology to classify companies as bankrupt or healthy by applying predictive analytics. The prediction model stated in this paper yields better accuracy with standard parameters used for bankruptcy prediction than previously applied prediction methodologies.
COMPARISON OF BANKRUPTCY PREDICTION MODELS WITH PUBLIC RECORDS AND FIRMOGRAPHICScscpconf
Many business operations and strategies rely on bankruptcy prediction. In this paper, we aim to
study the impacts of public records and firmographics and predict the bankruptcy in a 12-
month-ahead period with using different classification models and adding values to traditionally
used financial ratios. Univariate analysis shows the statistical association and significance of
public records and firmographics indicators with the bankruptcy. Further, seven statistical
models and machine learning methods were developed, including Logistic Regression, Decision
Tree, Random Forest, Gradient Boosting, Support Vector Machine, Bayesian Network, and
Neural Network. The performance of models were evaluated and compared based on
classification accuracy, Type I error, Type II error, and ROC curves on the hold-out dataset.
Moreover, an experiment was set up to show the importance of oversampling for rare event
prediction. The result also shows that Bayesian Network is comparatively more robust than
other models without oversampling.
An Innovative Approach to Predict Bankruptcyvivatechijri
Bankruptcy is a legal status of a person or other organization that cannot repay their debts to
creditors. Bankruptcy prediction is the task of predicting bankruptcy and by doing various surveys we can avoid
financial distress of firms. It is a huge area of accounting and finance research. The significance of this area is
an important part of financial specialists and creditors in assessing the probability that a firm may go bankrupt
or not. Estimating the risk of corporate bankruptcies is very important as the effect of bankruptcy is on a global
level. The aim of predicting financial distress is to develop a predictive model that combines various economic
factors which allow foreseeing the financial status of a firm. In this domain, various methods were proposed that
were based on neural networks, Support Vector Machines, Decision Trees, Random Forests, Naïve Bayes,
Balanced Bagging and Logistic Regression. In this paper, we document our observations as we explore and build
a Restricted Boltzmann Machine to Bankruptcy Prediction. We started by carrying out data pre-processing where
we impute the missing data values using Mean Imputation. To solve the data imbalance issue, we apply the
Synthetic Minority Oversampling Technique (SMOTE) to oversample the minority class labels. Finally, we
analyze and evaluate the performance of the model.
CLASSIFIER SELECTION MODELS FOR INTRUSION DETECTION SYSTEM (IDS)ieijjournal1
Any abnormal activity can be assumed to be anomalies intrusion. In the literature several techniques and
algorithms have been discussed for anomaly detection. In the most of cases true positive and false positive
parameters have been used to compare their performance. However, depending upon the application a
wrong true positive or wrong false positive may have severe detrimental effects. This necessitates inclusion
of cost sensitive parameters in the performance. Moreover the most common testing dataset KDD-CUP-99
has huge size of data which intern require certain amount of pre-processing. Our work in this paper starts
with enumerating the necessity of cost sensitive analysis with some real life examples. After discussing
KDD-CUP-99 an approach is proposed for feature elimination and then features selection to reduce the
number of more relevant features directly and size of KDD-CUP-99 indirectly. From the reported
literature general methods for anomaly detection are selected which perform best for different types of
attacks. These different classifiers are clubbed to form an ensemble. A cost opportunistic technique is
suggested to allocate the relative weights to classifiers ensemble for generating the final result. The cost
sensitivity of true positive and false positive results is done and a method is proposed to select the elements
of cost sensitivity metrics for further improving the results to achieve the overall better performance. The
impact on performance trade of due to incorporating the cost sensitivity is discussed.
Business Bankruptcy Prediction Based on Survival Analysis Approachijcsit
This study sampled companies listed on Taiwan Stock Exchange that examined financial distress between
2003 and 2009. It uses the survival analysis to find the main indicators which can explain the business
bankruptcy in Taiwan. This paper uses the Cox Proportional Hazard Model to assess the usefulness of
traditional financial ratios and market variables as predictors of the probability of business failure to a
given time. This paper presents empirical results of a study regarding 12 financial ratios as predictors of
business failure in Taiwan. It showed that it does not need many ratios to be able to anticipate potential
business bankruptcy. The financial distress probability model is constructed using Profitability, Leverage,
Efficiency and Valuation ratio variables. In the proposed steps of business failure prediction model, it used
detail SAS procedure. The study proves that the accuracies of classification of the mode in overall accuracy
of classification are 87.93%.
Applying Classification Technique using DID3 Algorithm to improve Decision Su...IJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
A predictive system for detection of bankruptcy using machine learning techniques
1. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
DOI : 10.5121/ijdkp.2015.5103 29
A PREDICTIVE SYSTEM FOR DETECTION OF
BANKRUPTCY USING MACHINE LEARNING
TECHNIQUES
Kalyan Nagaraj and Amulyashree Sridhar
PES Institute of Technology, India
ABSTRACT
Bankruptcy is a legal procedure that claims a person or organization as a debtor. It is essential to
ascertain the risk of bankruptcy at initial stages to prevent financial losses. In this perspective, different
soft computing techniques can be employed to ascertain bankruptcy. This study proposes a bankruptcy
prediction system to categorize the companies based on extent of risk. The prediction system acts as a
decision support tool for detection of bankruptcy
KEYWORDS
Bankruptcy, soft computing, decision support tool
1. INTRODUCTION
Bankruptcy is a situation in which a firm is incapable to resolve its monetary obligations leading
to legal threat. The financial assets of companies are sold out to clear the debt which results in
huge financial losses to the investors. Bankruptcy results in decreased liquidity of capital and
minimized financial improvement. It is reported by World Bank data that Indian government
resolves the insolvency in an average of 4.3 years [1]. There is a need to design effective
strategies for prediction of bankruptcy at an earlier stage to avoid financial crisis. Bankruptcy can
be predicted using mathematical techniques, hypothetical models as well as soft computing
techniques [2]. Mathematical techniques are primary methods used for estimation of bankruptcy
based on financial ratios. These methods are based on single or multi variable models.
Hypothetical models are developed to support the theoretical principles. These models are
statistically very complex based on their assumptions. Hence soft computing techniques are
extensively used for developing predictive models in finance. Some of the popular soft computing
techniques include Bayesian networks, logistic regression, decision tress, support vector machines
and neural networks.
In this study, different machine learning techniques are employed to predict bankruptcy. Further
based on the performance of the classifiers, the best model is chosen for development of a
decision support system in R programming language. The support system can be utilized by stock
holders and investors to predict the performance of a company based on the nature of risk
associated.
2. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
30
2. BACKGROUND
Several studies have been conducted in the recent past reflecting the importance of machine
learning techniques in predictive modelling. The studies and the technologies implemented are
briefly discussed below.
2.1. MACHINE LEARNING
Machine learning techniques are employed to explore the hidden patterns in data by developing
models. It is broadly referred as knowledge discovery in database (KDD). Different learning
algorithms are implemented to extract patterns from data. These algorithms can either be
supervised or unsupervised. Supervised learning is applied when the output of a function is
previously known. Unsupervised learning is applied when the target function is unknown. The
general layout for machine learning process is described below:
Data collection: The data related to domain of concern is extracted from public platforms and
data warehouses. The data will be raw and unstructured format. Hence pre-processing measures
must be adopted
Data pre-processing: The initial dataset is subjected for pre-processing. Pre-processing is
performed to remove the outliers and redundant data. The missing values are replaced by
normalization and transformation
Development of models: The pre-processed data is subjected to different machine learning
algorithms for development of models. The models are constructed based on classification,
clustering, pattern recognition and association rules
Knowledge Extraction: The models are evaluated to represent the knowledge captured. This
knowledge attained can be used for better decision making process [3].
2.2. CLASSIFICATION ALGORITHMS
Several classification algorithms are implemented in recent past for financial applications. They
are discussed briefly below:
Logistic Regression: It is a classifier that predicts the outcome based probabilities of logistic
function. It estimates the relationship between different independent variables and the dependent
outcome variable based on probabilistic value. It may be either binary or multinomial classifier.
The logistic function is denoted as:
F(x) = 1
1+e-(β0+β1x)
β0 and β1are coefficients for input variable x. The value of F(x) ranges from zero to one. The
logistic regression model generated is also called as generalized linear model [4].
Naïve Bayes classifier: It is a probabilistic classifier based on the assumptions of Bayes theorem
[5]. It is based on independent dependency among all the features in the dataset. Each feature
contributes independently to the total probability in model. The classifier is used for supervised
learning. The Bayesian probabilistic model is defined as:
3. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
31
)(
)|()(
)|(
xp
CxpCp
xCp
k
k
k
=
p(Ck|x) = posterior probability
p(Ck)=prior probability
p(x)= probability of estimate
p(x|Ck)=likelihood of occurrence of x
Random Forest: They are classifier which construct decision trees for building the model and
outputs the mode value of individual trees as result of prediction. The algorithm was developed
by Breiman [6]. Classification is performed by selecting a new input vector from training set. The
vector is placed at the bottom of each of the trees in the forest. The proximity is computed for the
tree. If the tree branches are at the same level, then proximity is incremented by one. The
proximity evaluated is standardized as a function of the number of trees generated. Random forest
algorithms compute the important features in a dataset based on the out of bag error estimate. The
algorithm also reduces the rate of overfitting observed in decision tree models.
Neural networks: They are learning algorithms inspired from the neurons in human brain. The
network comprises of interconnected neurons as a function of input data [7]. Based on the
synapse received from input data, weights are generated to compute the output function. The
networks can either be feed-forward or feed-back in nature depending upon the directed path of
the output function. The error in input function is minimized by subjecting the network for back-
propagation which optimizes the error rate. The network may also be computed from several
layers of input called as multilayer perceptron. Neural networks have immense applications in
pattern recognition, speech recognition and financial modeling.
Support vector machine: They are supervised learning algorithms based on non-probabilistic
classification of dataset into categories in high dimensional space. The algorithm was proposed
by Vapnik [8]. The training dataset is assumed as a p-dimensional vector which is to be classified
using (p-1) dimensional hyperplane. The largest separation achieved between data points is
considered optimal. Hyperplane function is represented as:
).(),,( bxwsignbwxf +=><
w = normalized vector to the hyperplane
x = p-dimensional input vector
b = bias value
The marginal separator is defined as 2|k|/||w||. ‘k’ represents the number of support vectors
generated by the model. The data instances are classified based on the below criteria
If (w · x +b) = k, indicates all the positive instances.
If (w · x +b) = -k, indicates the set of negative instances.
If (w · x +b) = 0, indicates the set of neutral instances.
3. RELATED WORK
Detection of bankruptcy is a typical classification problem in machine learning application.
Development of mathematical and statistical models for bankruptcy prediction was initiated by
Beaver in the year 1960 [9]. The study focused on the univariate analysis of different financial
factors to detect bankruptcy. An important development in this arena was recognized by Altman
who developed a multivariate Z-score model of five variables [10]. Z-score model is considered
as a standard model for estimating the probability of default in bankruptcy. Logistic regression
was also instigated to evaluate bankruptcy [11]. These techniques are considered as standard
4. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
32
estimates for prediction of financial distress. But these models pose statistical restrictions leading
to their limitations. To overcome these limitations probit [12] and logit models [13] were
implemented for financial applications. In later years, neural networks were implemented for
estimating the distress in financial organizations [14, 15, and 16]. Neural networks are often
subjected to overfitting leading to false predictions. Decision trees were also applied for
predicting financial distress [17, 18]. Support vector machines have also been used employed in
predicting bankruptcy for financial companies [19, 20]. In recent years, several hybrid models
have been adopted to improve the performance of individual classifiers for detection of
bankruptcy [21, 22].
4. METHODOLOGY
4.1. Collection of Bankruptcy dataset
The qualitative bankruptcy dataset was retrieved from UCI Machine Learning Repository [23].
The dataset comprised of 250 instances based on 6 attributes. The output had two classes of
nominal type describing the instance as ‘Bankrupt’ (107 cases) or ‘Non-bankrupt’ (143 cases).
4.2. Feature Selection
It is important to remove the redundant attributes from the dataset. Hence correlation based
feature selection technique was employed. The feature based algorithm selects the significant
attributes based on the class value. If the attribute is having high correlation with the class
variable and minimal correlation with other attributes of the dataset it is presumed to be a good
attribute. If the attribute have high correlation with the attributes then they are discarded from the
study.
4.3. Implementing machine learning algorithms
The features selected after correlational analysis are subjected to data partitioning followed by
application of different machine learning algorithms. The dataset is split into training (2/3rd
of the
dataset) and test dataset (1/3rd
of the dataset) respectively. In the training phase different
classifiers are applied to build an optimal model. The model is validated using the test set in the
testing phase. Once the dataset is segregated, different learning algorithms was employed on the
training dataset. The algorithms include logistic regression, Bayesian classifier, random forest,
neural network and support vector machines. Models generated from each of the classifier were
assessed for their performance using the test dataset. A ten-fold cross validation strategy was
adopted to test the accuracy. In this procedure, the test dataset is partitioned into ten subsamples
and each subsample is used to test the performance of the model generated from training dataset.
This step is performed to minimize the probability of overfitting. The accuracy of each algorithm
was estimated from the cross validated outcomes.
4.4. Developing a predictive decision support system
From the previous step, the classifier with highest prediction accuracy is selected for developing a
decision support system to predict the nature of bankruptcy. The prediction system was
implemented in RStudio interface, a statistical programming toolkit. Different libraries were
invoked for development of the predictive system including ‘gWidgets ’and ‘RGtk2’. The
predictive tool develops a model for evaluating the outcome bankruptcy class for user input data.
Predicted class is compared with the actual class value from the dataset to compute the percentage
of error prediction from the system. The support system estimates the probability of bankruptcy
5. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
33
among customers. It can be used as an initial screening tool to strengthen the default estimate of a
customer based on his practises.
The methodology of this study is illustrated in Figure 1.
Figure 1: The flowchart for developing a decision support system to predict bankruptcy
5. RESULTS AND DISCUSSION
5.1. Description of Qualitative Bankruptcy dataset
The bankruptcy dataset utilized for this study is available at UCI Machine Learning Repository.
The dataset comprising of six different features is described in Table 1. The distribution of class
outcome is shown in Figure-2.
Table 1: Qualitative Bankruptcy Dataset. (Here P=Positive, A=Average, N=Negative, NB=Non-
Bankruptcy and B=Bankruptcy)
Sl. No Attribute Name Description of attribute
01. IR (Industrial Risk) Nominal {P, A, N}
02. MR (Management Risk) Nominal {P, A, N}
03. FF (Financial Flexibility) Nominal {P, A, N}
04. CR (Credibility) Nominal {P, A, N}
05. CO (Competitiveness) Nominal {P, A, N}
06. OR (Operating Risk) Nominal {P, A, N}
07. Class Nominal {NB, B}
6. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
34
Figure 2: The distribution of class output representing Non Bankruptcy and Bankruptcy
5.2. Correlation based Attribute selection
The bankruptcy dataset was subjected for feature selection to extract the relevant attributes. The
nominal values in bankruptcy dataset were converted to numeric values for performing feature
selection. The values of each of the descriptors were scaled as P=1, A=0.5 and N=0 representing
the range for positive, average and negative values. The procedure was repeated for all the six
attributes in dataset. Pearson correlation filter was applied for the numerical dataset to remove the
redundant features with a threshold value of 0.7. The analysis revealed that all the six attributes
were highly correlated with the outcome variable. In order to confirm the results from correlation,
another feature selection method was applied for the dataset. Information gain ranking filter
method was applied to test the importance of features. The algorithm discovered similar results as
that of correlation. Hence all the six attributes from the dataset were considered for the study. The
correlational plot for the features is shown in Figure 3.
Figure 3: The correlational plot illustrating the importance of each feature
5.3. Machine learning algorithms
The features extracted from previous step were subjected for different machine learning
algorithms in R. The algorithms were initially applied for the training set to develop predictive
models. These models were further evaluated using the test set. Performance of each model was
adjudged using different statistical parameters like confusion matrix and receiver operating
characteristics (ROC) curve. Confusion matrix is a contingency table that represents the
performance of machine learning algorithms [24]. It represents the relationship between actual
class outcome and predicted class outcome based on the following four estimates:
7. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
35
a) True positive (TP): The actual negative class outcome is predicted as negative class from
the model
b) False positive (FP): The actual negative class outcome is predicted as a positive class
outcome. It leads to Type-1 error
c) False negative (FN): The actual positive class outcome is predicted as negative class from
the model. It leads to Type-2 error
d) True negative (TN): The actual class outcome excluded is also predicted to be excluded
from the model
Based on these four parameters the performance of algorithms can be adjudged by calculating the
following ratios.
FNTNFPTP
TNTP
Accuracy
+++
+
=(%)
FNTP
TP
TPR
+
=(%)
TNFP
FP
FPR
+
=(%)
FPTP
TP
precision
+
=(%)
ROC curve is a plot of false positive rate (X-axis) versus true positive rate (Y-axis). It is
represents the accuracy of a classifier [25].
The accuracy for all the models was computed and represented in Table 2. SVM classifier
achieved better accuracy compared to other machine learning algorithms. Henceforth the ROC
plot of RBF-based SVM classifier is represented in Figure 4.
Table 2: The accuracy of bankruptcy prediction of machine learning algorithms
Sl. No Algorithm Library
used in R
Accuracy
of
prediction
(%)
True
positive
rate
False
positive
rate
Precision
01. Logistic
regression
glmnet 97.2 0.972 0.028 0.97
02. Rotation
forest
randomForest 97.4 0.974 0.026 0.97
03. Naïve Bayes e1071 98.3 0.983 0.017 0.98
04. Neural
network
neuralnet 98.6 0.986 0.014 0.98
05. RBF-based
Support
vector
machine
e1071 99.6 0.996 0.004 0.99
8. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
36
Figure 4: The ROC curve representing the accuracy of SVM classifier
5.4. SVM based decision supportive system in R
Based on the accuracy in previous step, it was seen that support vector based classifier
outperformed other machine learning techniques. The classifier was implemented using radial
basis function (RBF) kernel [26]. It is also referred as Gaussian RBF kernel. The kernel
representation creates a decision boundary for the non-linear attributes in high dimensional space.
The attributes are converted to linear form by mapping using this kernel function. An optimal
hyperplane is constructed in feature space by considering the inner product of the kernel.
Hyperplane is considered as optimal if it creates a widest gap from the input attributes to the
target class. Furthermore, to achieve optimization C and gamma parameters are used. C is used to
minimize the misclassification in training dataset. If the value of C is smaller it is soft margin
creating a wider hyperplane, whereas the value of C being larger leads to overfitting called as
hard margin. Hence the value of C must be selected by balancing between the soft and hard
margin. Gamma is used for non-linear classifiers for constructing the hyperplane. It is used to
control the shape of the classes to be separated. If the value of gamma is small it results in high
variance and minimal bias producing a pointed thrust. While a bigger gamma value leads to
minimal variance and maximum bias producing a broader and soft thrust. The values of C and
gamma were optimized and selected for classification. The classified instances from RBF kernel
is observed in Figure 5.
Figure 5: RBF classifier based classification for bankruptcy dataset as either NB or B
Based on the RBF classifier the prediction system was constructed in R. The bankruptcy dataset
is initially loaded into the predictive system as a .csv file. The home page of predictive tool
loaded with bankruptcy dataset is shown in Figure 6.
9. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
37
Figure 6: SVM based predictive tool with the bankruptcy dataset
The system fetches the dataset and stores as a dataframe. Dataframe is a vector list used to store
the data as a table in R. RBF-kernel SVM model is developed for the bankruptcy dataset. It is
displayed in Figure 7.
Figure 7: Radial based SVM model developed for bankruptcy dataset
After the model is developed, users can enter their data in the text input boxes for predicting
bankruptcy. Each of the six input parameters have values as 1, 0.5 or 0 (positive, average and
negative) respectively. Based on SVM model built for the initial dataset, the predictive system
estimates the probability of bankruptcy as either B (Bankruptcy) or NB (Non Bankruptcy). The
10. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
38
predictive tool was tested for both non-bankruptcy and bankruptcy conditions. The results from
prediction are shown in Figure 8 and 9 respectively.
Figure 8: The predicted result as NB (Non-Bankruptcy) for user input data based on RBF-kernel.
Figure 9: The predicted result as B (Bankruptcy) for user input data based on RBF-kernel.
The performance of the tool was computed by comparing the predicted value with the actual
bankruptcy value for user data. It was found that the predictions were in par with the actual
11. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
39
outcomes for both the cases (NB and B). The predicted outcome is saved as a .csv file in the local
directory of user’s system to view the results represented in Figure 10.
Figure 10: The predicted results for bankruptcy from the tool are saved
5.5. Developing the prediction system as a package
The predictive system for detecting bankruptcy was encoded as a package in RStudio. The
package was developed using interdependent libraries ‘devtools’ and ‘roxygen2’. The package
can be downloaded by users in their local machines followed by installing and running the
package in RStudio. Once the package is installed users can run the predictive system for
detection of bankruptcy using their input data.
6. CONCLUSIONS
The results suggest that machine learning techniques can be implemented for prediction of
bankruptcy. To serve the financial organizations for identifying risk oriented customers a
prediction system was implemented. The predictive system helps to predict bankruptcy for a
customer dataset based on the SVM model.
REFERENCES
[1] Personal Bankruptcy or Insolvency laws in India. (http://blog.ipleaders.in/personal-bankruptcy-or-
insolvency-laws-in-india/). Retrieved on Nov, 2014.
[2] Hossein Rezaie Doolatabadi, Seyed Mohsen Hoseini, Rasoul Tahmasebi. “Using Decision Tree
Model and Logistic Regression to Predict Companies Financial Bankruptcy in Tehran Stock
Exchanges.” International Journal of Emerging Research in Management &Technology, 2(9): pp. 7-
16.
[3] M. Kantardzic (2003). “Data mining: Concepts, models, methods, and algorithms.” John Wiley &
Sons.
[4] X. Wu et. al (2008). “Top 10 algorithms in data mining”. Knowl Inf Syst. 14: pp 1-37.
[4] Hosmer David W, Lemeshow, Stanley (2000). “Applied Logistic Regression”. Wiley.
12. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015
40
[5] Rish, Irina (2001). "An empirical study of the naive Bayes classifier". IJCAI Workshop on Empirical
Methods in AI.
[6] Breiman, Leo (2001). "Random Forests.” Machine Learning 45 (1):pp. 5–32.
[7] McCulloch, Warren; Walter Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous
Activity". Bulletin of Mathematical Biophysics 5 (4): pp. 115–133.
[8] Cortes, C.; Vapnik, V. (1995). "Support-vector networks". Machine Learning 20 (3): 273.
[9] Altman E (1968). Financial ratios, discriminant analysis and prediction of corporate bankruptcy. The
Journal of Finance. 23(4), 589-609.
[10] Martin, D (1977). Early warning of bank failures: A logit regression approach. Journal of Banking
and Finance. 1, 249-276.
[11] Lo, A (1984). Essays in financial and quantitative economics. Ph.D. dissertation, Harvard University.
[12] J. A. Ohlson (1980). “Financial ratios and the probabilistic prediction of bankruptcy,” Journal of
Accounting Research, 18(1), pp. 109-131.
[13] B. Back et. al (1996). “Choosing Bankruptcy Predictors using Discriminant Analysis, Logit Analysis
and Genetic Algorithms,” Turku School of Economics and Business Administration.
[14] J. E. Boritz, D. B. Kennedy (1995) “Effectiveness of neural network types for prediction of business
failure,” Expert Systems with Applications. 9(4): pp. 95-112.
[15] P. Coats, L. Fant (1992). “A neural network approach to forecasting financial distress,” Journal of
Business Forecasting, 10(4), pp. 9-12.
[16] M. D.Odom, R. Sharda (1990), “A neural network model for bankruptcy prediction,” IJC NN
International Joint Conference on Neural Networks, 2, p. 163-168.
[17] J. Sun, H. Li (2008). “Data mining method for listed companies’ financial distress prediction,”
Knowledge-Based Systems. 21(1): pp. 1-5. 2008.
[18] . H. Ekşi (2011). “Classification of firm failure with classification and regression trees,” International
Research Journal of Finance and Economics. 76, pp. 113-120.
[19] V. Fan, v. Palaniswami (2000).“Selecting bankruptcy bredictors using a support vector machine
approach,” Proceedings of the Internal Joint Conference on Neural Networks. pp. 354-359.
[20] S. Falahpour, R. Raie (2005). “Application of support vector machine to predict financial distress
using financial ratios”. Journal of Accounting and Auditing Studies. 53, pp. 17-34.
[21] Hyunchul Ahn, Kyoung-jae Kim (2009). “Bankruptcy prediction modeling with hybrid case-based
reasoning and genetic algorithms approach”. Applied Soft Computing. 9(2): pp. 599-607.
[22] Ming-Yuan Leon Li, Peter Miu (2010). “A hybrid bankruptcy prediction model with dynamic
loadings on accounting-ratio-based and market-based information: A binary quantile regression
approach”. Journal of Empirical Finance. 17(4): pp. 818-833.
[23] A. Martin, T Miranda Lakshmi, Prasanna Venkateshan (2014). “An Analysis on Qualitative
Bankruptcy Prediction rules using Ant-miner”. IJ Intelligent System and Applications. 01: pp. 36-44.
[24] Stehman, Stephen V. (1997). "Selecting and interpreting measures of thematic classification
accuracy". Remote Sensing of Environment. 62 (1): pp. 77–89.
[25] Fawcett, Tom (2006). "An Introduction to ROC Analysis". Pattern Recognition Letters. 27 (8): pp.
861 – 874.
[26] Vert, Jean-Philippe, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods".
Kernal Methods in Computational Biology.