Peer-to-peer lending companies provide online platforms that can quickly pair borrowers seeking a loan with investors willing to fund the loan at an attractive rate. Since these loans are unsecured and companies creating the market generally do not invest their own capital, neither borrowers nor companies assume any risk. Entire credit risk is born by investors. Literature shows that credit risk depends upon borrower characteristics, loan terms and regional macroeconomic factors. To help investors identify unsecured loans likely to be fully paid, a machine learning algorithm was developed to forecast probability of full payment and probability of default.
Training and input data consisted of historic loans’ data from Lending Club and state level macroeconomic data from government and organizational sources. A logistic regression was
shown to provide optimal results, effectively sequestering high risk loans.
Team Members:
Archange Giscard Destine
ad1373@georgetown.edu
linkedin.com/in/agdestine
Steven L. Lerner
sll93@georgetown.edu
linkedin.com/in/sllerner
Erblin Mehmetaj
em1109@georgetown.edu
www.linkedin.com/in/erblinmehmetaj
Hetal Shah
hrs41@georgetown.edu
linkedin.com/in/hetalshah
The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...inventionjournals
The solvency ratio is a ratio that can be used to influence lending decisions on the BPR. This research purpose to test and find empirical evidence whether the Debt to Assets Ratio, Times Interest Earned Ratio, and Long-term Debt to Equity Ratio influence on lending decisions. The population useful for customers apply for credit to the BPR Dinar Pusaka in the district Sidoarjo. The sample in this research were selected using purposive sampling method until elected only 30 customers during the three periods, namely the year 2013 to 2015. Data analysis technique used is the logistic regression analysis. The research results show that Times Interest Earned Ratio variable does not affect the lending decisions. Meanwhile, the variable Debt to Assets Ratio and Long-term Debt to Equity Ratio influence on lending decisions
PROJECT STORYBOARD: Project Storyboard: Reducing Underwriting Resubmits by Ov...GoLeanSixSigma.com
GoLeanSixSigma.com Black Belt Tyson Simmons project to reduce underwriting package defects and subsequent re-submission demonstrates some great points. His team voted to narrow down potential root causes and noted them with dots on their Fishbone Diagram. Then the big "Oh darn!" When they tested the suspected root causes (analyst and submitter), neither of them proved to be statistically significant.
What do you do when all of your root causes prove to be false? You go back and look for more which is what Tyson did. The red dots on the Fishbone Diagram suggested the next possible root cause, which did prove out. Nice job, Tyson, for sticking with the process and shooting right past your goal!
– Bill Eureka, GoLeanSixSigma.com Master Black Belt Coach
The aim of this study is to determining the factors which could affect the credit scoring to reveal the relationship between economical policies implemented in Turkey and the credit ratings given by credit scoring agencies with econometrics method along with comparisons among countries. When the countries own resources are not enaugh to finance economical growth, countries are needed for foreign investments.These foreign investments are wanted by countries as direct foreign investments or financial investments. Both kinds want to have a trust on types of economies to invest on them. For this reason it is needed to have a indicator for safety of a country to invest .The most important indicator developed for this purpose is credit rate. Thus, figures of GDP, Current Account Balance, Foreign Borrowing and Inflation of Turkey in the year of the 2000-2015 using parametric and semiparametric logit models. The semiparametric methods best fitting models using best fitting smoothing methods when the combines that best features of the parametric and nonparametric approaches when the parametric model violated. We used the data of IMF World Economic Outlook Database and IMF Article IV countries reports, Moody’s,Standart&Poors and Fitch main reports on site.
Peer-to-peer lending companies provide online platforms that can quickly pair borrowers seeking a loan with investors willing to fund the loan at an attractive rate. Since these loans are unsecured and companies creating the market generally do not invest their own capital, neither borrowers nor companies assume any risk. Entire credit risk is born by investors. Literature shows that credit risk depends upon borrower characteristics, loan terms and regional macroeconomic factors. To help investors identify unsecured loans likely to be fully paid, a machine learning algorithm was developed to forecast probability of full payment and probability of default.
Training and input data consisted of historic loans’ data from Lending Club and state level macroeconomic data from government and organizational sources. A logistic regression was
shown to provide optimal results, effectively sequestering high risk loans.
Team Members:
Archange Giscard Destine
ad1373@georgetown.edu
linkedin.com/in/agdestine
Steven L. Lerner
sll93@georgetown.edu
linkedin.com/in/sllerner
Erblin Mehmetaj
em1109@georgetown.edu
www.linkedin.com/in/erblinmehmetaj
Hetal Shah
hrs41@georgetown.edu
linkedin.com/in/hetalshah
The Influence of Solvency Ratio Decision on Rural Bank Dinar Pusaka In The Di...inventionjournals
The solvency ratio is a ratio that can be used to influence lending decisions on the BPR. This research purpose to test and find empirical evidence whether the Debt to Assets Ratio, Times Interest Earned Ratio, and Long-term Debt to Equity Ratio influence on lending decisions. The population useful for customers apply for credit to the BPR Dinar Pusaka in the district Sidoarjo. The sample in this research were selected using purposive sampling method until elected only 30 customers during the three periods, namely the year 2013 to 2015. Data analysis technique used is the logistic regression analysis. The research results show that Times Interest Earned Ratio variable does not affect the lending decisions. Meanwhile, the variable Debt to Assets Ratio and Long-term Debt to Equity Ratio influence on lending decisions
PROJECT STORYBOARD: Project Storyboard: Reducing Underwriting Resubmits by Ov...GoLeanSixSigma.com
GoLeanSixSigma.com Black Belt Tyson Simmons project to reduce underwriting package defects and subsequent re-submission demonstrates some great points. His team voted to narrow down potential root causes and noted them with dots on their Fishbone Diagram. Then the big "Oh darn!" When they tested the suspected root causes (analyst and submitter), neither of them proved to be statistically significant.
What do you do when all of your root causes prove to be false? You go back and look for more which is what Tyson did. The red dots on the Fishbone Diagram suggested the next possible root cause, which did prove out. Nice job, Tyson, for sticking with the process and shooting right past your goal!
– Bill Eureka, GoLeanSixSigma.com Master Black Belt Coach
The aim of this study is to determining the factors which could affect the credit scoring to reveal the relationship between economical policies implemented in Turkey and the credit ratings given by credit scoring agencies with econometrics method along with comparisons among countries. When the countries own resources are not enaugh to finance economical growth, countries are needed for foreign investments.These foreign investments are wanted by countries as direct foreign investments or financial investments. Both kinds want to have a trust on types of economies to invest on them. For this reason it is needed to have a indicator for safety of a country to invest .The most important indicator developed for this purpose is credit rate. Thus, figures of GDP, Current Account Balance, Foreign Borrowing and Inflation of Turkey in the year of the 2000-2015 using parametric and semiparametric logit models. The semiparametric methods best fitting models using best fitting smoothing methods when the combines that best features of the parametric and nonparametric approaches when the parametric model violated. We used the data of IMF World Economic Outlook Database and IMF Article IV countries reports, Moody’s,Standart&Poors and Fitch main reports on site.
Estimation of Net Interest Margin Determinants of the Deposit Banks in Turkey...inventionjournals
Banks, which are the irreplaceable intermediaries of the financial system, are financial institutions that significantly contributeto economic development. The basiccriterion that indicates the efficiency of the intermediation activities of banks is the net interest margins. These costs are assumed to be high for developing countries such as Turkey. The degree to which banks are willing to redeem the funds they collect as credit to the system is directly related to how low their intermediation costs will be. In this paper, it is aimed to estimate the net interest margin determinants of deposit banks in Turkey. Three different panel data models are used for this purpose. These are the Fixed and Random Static models and the GMM (Generalized Moment Models) Dynamic model
https://ijitce.com/index.php
Our journal maintains rigorous peer review standards. Each submitted article undergoes a thorough evaluation by experts in the respective field. This stringent review process helps ensure that only high-quality and scientifically sound research is accepted for publication. Researchers can trust that the articles they find in IJITCE have been critically assessed for validity, significance, and originality.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Mortgage Banking: A Holistic Approach to Managing Compliance RiskCognizant
With regulatory compliance requirements rapidly on the rise, we offer a full-spectrum approach for mortgage banks for compliance risk management, combining regulatory analysis, identifying competing regulations, instituting operational process controls, effective data quality and document management strategies.
Mortgage Insurance Data Organization Havlicek Mrotekkylemrotek
Presentation on the organization of mortgage insurance data for loss reserving purposes, presented at the Casualty Actuarial Society\'s 2008 RPM conference in Boston
The information you provided appears to be a list of column headers or variables related to a dataset containing information about loans or credit-related data. Here's a brief description of each column:
1. credit.policy: A binary variable indicating whether a customer meets the credit policy criteria (1 for yes, 0 for no).
2. purpose: The purpose for which the loan was taken (e.g., debt consolidation, credit card, small business).
3. int.rate: The interest rate of the loan.
4. installment: The monthly installment payment amount.
This presentation discusses the criteria an institution should use to evaluate its ALLL, recommendations and best practices to support a change and key areas examiners investigate after a significant change to the ALLL. See how automation can help: http://web.sageworks.com/alll/
Our goal with this commentary is to put common credit score myths to rest, and to shed some light on what the insights and research from Equifax proves to be true.
AI-based credit scoring - An Overview.pdfStephenAmell4
AI-based credit scoring is a contemporary method for evaluating a borrower’s creditworthiness. In contrast to the conventional approach that hinges on static variables and historical information, AI-based credit scoring harnesses the power of machine learning algorithms to scrutinize an extensive array of data from various sources.
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
Project Details: In this study, the concept and application of credit scoring in a German banking environment is
explained. A credit scoring model has been developed using logistic regression and random forest. Limitations of
the model are explained and possible solutions are given with an overview of LASSO.
Guide: Dr. Sibnarayan Guria, Associate Professor and Head of the Department, Department of
Statistics, West Bengal State University
Language Used: R
Estimation of Net Interest Margin Determinants of the Deposit Banks in Turkey...inventionjournals
Banks, which are the irreplaceable intermediaries of the financial system, are financial institutions that significantly contributeto economic development. The basiccriterion that indicates the efficiency of the intermediation activities of banks is the net interest margins. These costs are assumed to be high for developing countries such as Turkey. The degree to which banks are willing to redeem the funds they collect as credit to the system is directly related to how low their intermediation costs will be. In this paper, it is aimed to estimate the net interest margin determinants of deposit banks in Turkey. Three different panel data models are used for this purpose. These are the Fixed and Random Static models and the GMM (Generalized Moment Models) Dynamic model
https://ijitce.com/index.php
Our journal maintains rigorous peer review standards. Each submitted article undergoes a thorough evaluation by experts in the respective field. This stringent review process helps ensure that only high-quality and scientifically sound research is accepted for publication. Researchers can trust that the articles they find in IJITCE have been critically assessed for validity, significance, and originality.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
Mortgage Banking: A Holistic Approach to Managing Compliance RiskCognizant
With regulatory compliance requirements rapidly on the rise, we offer a full-spectrum approach for mortgage banks for compliance risk management, combining regulatory analysis, identifying competing regulations, instituting operational process controls, effective data quality and document management strategies.
Mortgage Insurance Data Organization Havlicek Mrotekkylemrotek
Presentation on the organization of mortgage insurance data for loss reserving purposes, presented at the Casualty Actuarial Society\'s 2008 RPM conference in Boston
The information you provided appears to be a list of column headers or variables related to a dataset containing information about loans or credit-related data. Here's a brief description of each column:
1. credit.policy: A binary variable indicating whether a customer meets the credit policy criteria (1 for yes, 0 for no).
2. purpose: The purpose for which the loan was taken (e.g., debt consolidation, credit card, small business).
3. int.rate: The interest rate of the loan.
4. installment: The monthly installment payment amount.
This presentation discusses the criteria an institution should use to evaluate its ALLL, recommendations and best practices to support a change and key areas examiners investigate after a significant change to the ALLL. See how automation can help: http://web.sageworks.com/alll/
Our goal with this commentary is to put common credit score myths to rest, and to shed some light on what the insights and research from Equifax proves to be true.
AI-based credit scoring - An Overview.pdfStephenAmell4
AI-based credit scoring is a contemporary method for evaluating a borrower’s creditworthiness. In contrast to the conventional approach that hinges on static variables and historical information, AI-based credit scoring harnesses the power of machine learning algorithms to scrutinize an extensive array of data from various sources.
Consumer Credit Scoring Using Logistic Regression and Random ForestHirak Sen Roy
Project Details: In this study, the concept and application of credit scoring in a German banking environment is
explained. A credit scoring model has been developed using logistic regression and random forest. Limitations of
the model are explained and possible solutions are given with an overview of LASSO.
Guide: Dr. Sibnarayan Guria, Associate Professor and Head of the Department, Department of
Statistics, West Bengal State University
Language Used: R
"Growth Analytics: Evolution, Community and Tools" with emphasis on Google Analytics (and its API), including examples of how web analysts and data scientists can use this rich source of data for analysis and applications.
Customer analytics meetup in Dublin May '18
https://www.meetup.com/Customer-Analytics-Dublin-Meetup/events/250809233/
Covers key concepts of clickstream analysis and Markov Chains. Followed by 3 practical applications with the R language:
- Frequent path analysis
- Future click prediction
- Transition probabilities mapping
Niche bloggers up to multinational corporations, they are all interested in monitoring their web traffic and its patterns across time.
Google Analytics is the most widely used solution to keep track of this type of data. It provides a UI for a wide range of reports and possibilities for various types of visualizations.
Moreover, the availability of the Analytics API coupled with the corresponding R packages can now give more options for custom web analyses.
The plan for this talk is to cover the following :
• What is web analytics ? How it works ?
• Interfacing with the Analytics Reporting API via an R package (RGA)
• Practical analytics applications with R
• Discussion
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
1. Higher Diploma in Data Analytics
Programming for Big Data Project
Alexandros Papageorgiou
Student ID: 15019004
Analysis I:
The major factors determining the prediction of interest rate for Lending Club Loan
request.
Analysis II:
Prediction of activity based on mobile phone spatial measurements
Analysis III:
Analysis of user interaction with online ads on a major news website with Spark
2. The major factors determining the prediction of interest rate for Lending Club Loan request.
Objectives of the analysis
Loans are common place in nowadays and advances in the finance industry have made the process
of requesting a loan a highly automated process. A major component in this process is the level of
interest rate.
This is determined based on a number of factors both from the applicant’s credit history as well as
the application data submitted with the request like their employment history, credit history, and
creditworthiness scores (lendingclub.com, 2015).
Determining the interest rate can be a complex task that requires advanced data analysis.
The purpose of this analysis is to spot the association between interest rates and a number of other
factors based on the loan application data (such as their employment history, credit history, and
creditworthiness scores) as well as data provided by external sources in order to get a better
understanding of how the interest rate is determined and attempt to quantify these relationships.
Particularly this study investigates beyond FICO (the main measure of the credit worthiness of the
applicant) which are the other factors that can have an impact. Using exploratory analysis and
standard multiple regression techniques it is demonstrated that there is significant relationship
between Interest rate and FICO as well as 2 other variables (amount requested and length of the
loan)
Dataset Description
For this analysis a 2500 sample observations dataset was used containing 2500 observations
(rows) and 14 variables (columns) from the lending club website downloaded using the R
programming language (R-Core-Team, 2015).
The lending club data used in this analysis contains observations in code names as seen below,
measuring the following
Amount.Requested: The amount (in dollars) requested in the loan application
Amount.Funded.By.Investors: The amount (in dollars) loaned to the individual
Interest.rate: The lending interest rate.
Loan.length: The length of time (in months) of the loan
Loan.Purpose: The purpose of the loan as stated by the applicant
Debt.to.Income.Ratio: The percentage of consumer’s gross income that goes towards
paying debts
State: The abbreviation for the U.S. state of residence of the loan applicant
Home.ownsership: A variable indicating whether the applicant owns, rents, or has a
mortgage on their home.
Monthly.income: The monthly income of the applicant (in dollars).
FICO.range: A range indicating the applicants FICO score. This is a measure of the credit
worthiness of the applicant
Open.Credit.Lines: The number of open lines of credit the applicant had at the time of
application.
Revolving.Credit.Balance: The total amount outstanding all lines of credit
Inquiries.in.the.Last.6.Months: The number of authorized queries in the 6 months before
the loan was issued.
Employment.Length: Length of time employee at current job.
3. Challenge: Data not in a tidy form
Exploratory analysis was the method used via constructing plots and relevant tables to examine the
quality of the data provided and explore possible associations between interest rate and the
independent variables. This was after handling the 7 missing values found, making sure that
analysis is performed based on complete cases. This was based on the assumption of no significant
effect on the analysis due to low size of the missing values.
Other data type transformations:
Several factor or character variables converted to numerical
Removal of % symbol from interest rate,
FICO range converted from a range in to a single figure
Renaming of variables where appropriate.
Rationale for the transformations:
Those transformations were made in order to enable a more flexible handling of the data through R
especially by transforming them in numerical forms.
To relate interest rate with its major components a standard linear regression model was deployed.
The model selection was performed on the basis of the exploratory analysis and prior knowledge of
the relationship between interest rate and the factors that are considered critical to its
determination.
Data processing activities
As noted above a minimal number of missing values were identified and where appropriate
removed, beyond that the data was found to be within normal and acceptable ranges without any
extremities in interest rates and the other independent variables either.
The final dataset was in line with the tidy data rule (Wickham, 2015).
As a first step in the exploratory analysis a correlation analysis of all the numeric variables was
introduced in order to identify possible associations among them and particularly the ones that
correlate well with the interest rate.
The results of this first analysis reveal quite a high negative type correlation among the interest
rate and the FICO score (r=-0.7) and there is also some correlation with amount requested and
amount funded on the level of r=0.33.
The correlation among other variables with interest rate was relatively low.
To carry on with the analysis the information provided on the club’s website was considered, which
mentions as credit risk indicators factored into the model for the interest rate, the following:
Requested amount loan
Loan maturity (36 or 60 months)
Debt to income ratio
Length of credit history
Number of other accounts opened
Payment history
Number of other credit inquiries initiated over the past six months.
4. The FICO score is also quite explicitly mentioned as a decisive factor so in this context this will
unavoidably be one of the variables that will define the model.
A number of experiments through box plots were performed in order to identify possible
relationships of the interest rate with categorical variable. It turned out that what seems to have an
impact on the interest rate is the length of the loan.
Obviously overloading the model by including all the variables is not the optimal strategy (Cohen,
2009) and therefore as selection has to be made based on the results of the correlation analysis, the
box plots for the categorical variables and the information provided on the web site.
After testing with a number of models the one that was found to be best fit for this analysis is the
following:
Interest Rate= b0 + b1(FICO Score) + b2 (Requested Amount) + b3(Length of the Loan) + e
where b0 is an intercept term and b1 represents the change of the (negative) interest rate for a
given change of one unit of FICO score, similarly b2 represents the impact on interest for a one
dollar increase of the requested amount. The term length of loan is a categorical two level variable
that represents the change of the interest rate with a change from 36 months to 60 months of loan
period, at average levels of the other two independent variables.
The error term e represents all sources of unmeasured and un-modelled random variation
(Stockburger, 2015)
For the length of loan a set of dummy variables were implemented so that the R function can
interpret the data more effectively. As term of reference was selected “36 months”.
In the case of amount obviously due to confounder concerns (the two variables correlate not just
with the interest rate but most obviously between themselves as well) just the amount requested
was included in the regression model (the funded amount obviously directly depends on the
originally requested amount)
We observed: highly statistically significant (P =2e-16) association between interest rate and FICO
score. A change of one unit FICO corresponded to a negative change of b1 = 8.9 on Interest rate
(95% Confidence Interval: -8.984321e-02 -8.507495e-02).
Association between interest rate and amount requested (P =2e-16). A change of one unit amount
requested corresponded to a change of b2 = 1.446e-04 on Interest rate (95% Confidence Interval:
1.319564e-04 1.573394e-04).
Last, with an intercept of 7.245e+01 it is observed that this is the amount of interest rate that
corresponds when all the coefficients are set to zero which corresponds to the projected value for
the 36 month period of loan, while when the coefficient takes the value of one, this corresponds to
the value for the 60 month length. (P =2e-16)
The model has an Adjusted R-squared: 0.7454 which corresponds to the amount of variation that is
explained by the model.
A -limited in scope analysis- of residuals to compare the effectiveness of a multiple regression
against the simple linear regression model shows that non-random residual variance is better fitted
with the second one.
5. Conclusions:
The analysis suggests that there is a significant, positive association between Interest rate and FICO
score as well as factors such as loan length and amount of loan requested. The analysis estimated
the relationship using a linear model.
This analysis provides some insights with regard to the ways a loan institution like lending club
determines the cost of money for its customers, it therefore makes sense for the borrowers to be
aware of the major factors that determine the interest rate they will be asked to pay and possibly
based on this knowledge take action that could contribute in more favourable terms (for example
ask for a lower amount and return the loan sooner rather than later.
It is important to keep in mind that this study is a result of a limited dataset of just one institution
and therefore it might be subject to bias. As time goes on and depending also on other parameters
of the national and international economy other factors might come to play critical roles too. In any
case an informed customer who is aware of this type of analysis is likely to make better decisions in
his or her loan purchase.
Works Cited
Cohen, Y., 2009. Statistics and Data with R. s.l.:Wiley.
lendingclub.com, 2015. Interest rates and how we set them. [Online]
Available at: https://www.lendingclub.com/public/how-we-set-interest-rates.action
[Accessed 25 11 2015].
R-Core-Team, 2015. R: A language and environment for statistical computing. [Online]
Available at: http://www.R-project.org
Stockburger, D. W., 2015. Multiple Regression with Categorical Variables. [Online]
Available at: http://www.psychstat.missouristate.edu/multibook/mlt08m.html
Wickham, H., 2015. Tidy Data. [Online]
Available at: http://vita.had.co.nz/papers/tidy-data.pdf
6. Title: Prediction of activity based on mobile phone spatial measurements
Introduction and Objectives:
Advances in technology of mobile phones and the proliferation of smart devices have enabled the
collection of spatial data of smart phone users with the intention of studying the relation between
the measurements registered with the devices and the corresponding synchronous activity of the
subjects.
Data analysis methodology will be used with a prediction model to determine user activity based on
a wide range of signals related to body motion.
In particular the above analysis is based on the records of the Activity Recognition database which
was built from the recordings of 30 subjects doing Activities of Daily Living (ADL) while carrying a
waist-mounted smartphone with embedded inertial sensors including accelerometers and
gyroscopes. The objective is the recognition of six different human activities based on the
quantitative measurements of the Samsung phones.
Data Description
A group of 30 volunteers were selected for this task from the original research team. Each person
was instructed to follow a predefined set of activities while wearing a waist-mounted Smartphone.
The six selected ADLs were standing, sitting, laying down, walking, walking downstairs and
upstairs.
The respective. data set was downloaded from the URL
https://sparkpublic.s3.amazonaws.com/dataanalysis/samsungData.rda. The data was partly
preprocessed to facilitate its use within the R environment by the authors of the Coursera data
analysis course (https://github.com/jtleek/dataanalysis).
The data consists of 7352 entities (rows), each of which corresponds to a time indexed activity of
each of the 21 subjects and 563 variables (columns) corresponding to measurements of two
sensors.
Specifically for each record, the data provided:
- Acceleration from the accelerometer (total acceleration) and the estimated body acceleration. ( X,
Y, Z axis)
- Angular velocity from the gyroscope. (X, Y, Z axis)
-Various descriptive statistics based on the above measurements
Also, 2 additional pieces of information included as variables
-Corresponding activity
-Subject identifier
Data processing
The different activities were relatively evenly distributed and the same applies to the observations
for every subject, so no extremes found in this context.
7. All the columns referring to measurement are numeric. The subject is integer and the activity has
the type of character. This type is transformed to factor to assist with R processing given that
activity is the actual dependent variable of this dataset.
Prior to the analysis, a number of additional data transformations needed to take place. There are
some issues, for example a number of variables appear to have the same names but different values.
In specific, the bandsEnergy-related variables, are repeated in sets of 3. For example, columns 303-
316,317-330, and 331-344 have the same column names.
To fix this the variables were renamed in such a way to avoid duplication and possible problems
with the analysis of the data. Those transformations were made in order to enable a more flexible
handling of the data through R.
Moreover the variable names are cleaned by removing some punctuation like “( )” and “-“
characters in names to make it make syntactically valid.
An unusual fact observed is that all the numeric data appear to be within a range of -1 and 1 but this
turns out to be because the data is normalized.
There were no missing values found just complete cases observed and except the above mentioned
no other data type transformations were found to be necessary.
The dataset was split in two sets for training and testing. On a random split training set include the
subject ids 1,3,5 and 6- Total of 4 samples corresponding to 328 observations. The test set includes
ids 27, 28, 29 and 30. Therefore a total of 4 samples corresponding to 371 observations.
Results:
The selected method for the analysis is classification trees. Trees are particularly useful when there
are have many explanatory variables. If the “twigs” of the tree are categorical a classification tree is
recommended in order to partition the data ultimately into groups that are as homogeneous as
possible (Pekelis, 2013)
The next call to make is the selection of the variables that will be integrated into the model. It was
decided that with the very high number of columns in the dataset, it might not be meaningful to
examine each one individually. The first exploratory attempt to fit the tree model was with test tree
prediction on the first variable set, related to Body acceleration (tBodyAcc) which includes the first
15 variables of the data set. A classification tree was grown through the training set, and the
predictive model was tested on the test data. The misclassification rate was as high as 41%. This led
to the straightforward abandonment of this first set of variables.
As a next step, and considering again the nature of the variables, a more comprehensive approach
was chosen, that would include all the variables (with the obvious exception of subject id) and let
the classification tree algorithm choose the critical nodes.
There were 11 variables selected as nodes namely:
"tBodyAcc.std.X" "tGravityAcc.mean.X" "tGravityAcc.max...Y"
"tGravityAcc.mean.Y" "fBodyGyro.meanFreq.X" "tGravityAcc.arCoeff.X.1"
"tBodyAccJerk.max.X" "tBodyGyroMag.arCoeff..1" "fBodyAcc.bandsEnergy.1.8"
"tBodyAcc.arCoeff.X.3"
8. These variables correspond to various summary statistics measurements from the two sensors
The misclassification error rate was ~3 % on the training set, practically failing to classify correctly
the activity just nine time out of 328. The value of the model will not be proven unless applied on to
the test set. Once it performs versus the test set the error in prediction reaches 19.6% therefore
being successful in predicting over 80% of the cases.
The most significant nodes at the higher level of the tree are the Body acceleration standard
deviation of X the Gravity acceleration mean of X and the Gravity acceleration coefficient of X.
To check if the tree has any potential to improve its performance across validation experiment was
made to find the deviance and number of misclassifications as related to the size of our model.
Given the graph, it appears to be the case that the model has the least amount of misclassifications
and deviance for a model size of 8. Given this result the next step is to prune the tree by predefining
the number of nodes to be used to 8.
Fitting the model of 8 nodes to the test data produces 20.2 % error rate so it is marginally lower
than the previous model.
Conclusions:
Based on the results of the data analysis it turns out that the classification tree method was
effective into handling a large number variables, which due to size and the nature of the variables,
they would not have been able to be analyzed one by one. That said, there is definitely room for
improvement especially if an analysis has the possibility to look deeper into the meaning of the
variables and identifies patterns and relationships between them that could help regarding the
selection of the variables for the model, instead of opting for the comprehensive approach as it was
the case with analysis.
In this relatively straightforward approach adopted, no potential confounders were identified, as
this model did not include any linear analysis. The main criterion for judging the model was
accuracy, but additional ways of measuring error can be considered. Also possible to use are
techniques that are likely to improve the initial classification tree results, such as random forests
(Chen, 2009)
Works Cited
Chen, F., 2009. R Examples of Using Some Prediction Tools. [Online]
Available at: stat.fsu.edu/~fchen/prediction-tool.pdf
[Accessed 05 12 2015].
Pekelis, L., 2013. Classification And Regression Trees : A Practical Guide for Describing a Dataset.
[Online]
Available at: http://statweb.stanford.edu/~lpekelis/talks/13_datafest_cart_talk.pdf
[Accessed 04 12 2015].
Y.Theodoridis, 1996. ACM Digital Library-A model for the prediction of R-tree performance. [Online]
Available at: dl.acm.org/citation.cfm?id=237705
[Accessed 07 12 2015].
9. Title: Analysis of user interaction with online ads on a major news website
The dataset
The dataset is part of a sequence of files that include daily click through to online ads data, based on
user character tics as recorded on the New York Times web site in May of 2012. The datasets are
available on the “Doing data science” book (Schutt, 2013) Github repo, in this analysis only the first
day of available data is considered. It contains over 458,000 observations including 5 variables:
Age of User – numerical variable
Gender – binary variable
Signed_In – binary variable representing if the users was logged in or not
Impressions – the number of ad impressions during the session
Clicks - the number of click-throughs to one or more ads on the website
Every row corresponds to a user. It is generally speaking a simple low dimensional dataset which
however can be used to conduct a basic analysis of the user behaviour on the website with relation
to user interaction with the ad content on the site.
Configurations: Setting up the Environment
The platform used for this analysis was IBM Bluemix, which via the integrated notebook interfaces
allows access to Apache Spark, an open source –in memory- data processing engine for cluster
computing that shares some common ground with the Hadoop Map Reduce programming
framework.
The dataset in csv format is first uploaded as a new data source on Bluemix with the Object storage
service, which is associated with the required credentials.
The first step is to define a function that sets the Hadoop configuration with the credentials as
parameter.
Next using the insert code function from the data sources a dictionary with the credentials
associated with the data source is created and then used as an argument into the set Hadoop
configuration function in order to activate the service.
For the data processing activities that follow Pyspark, the Python API to Spark is used.
The data structured used throughout the analysis is Spark DataFrame.
Algorithms, results, challenges:
The algorithms used for the analysis are for the most part of the category ‘split-apply-combine’
whereby the data are grouped based on an attribute, then a function is applied onto the grouped
data summarising the values within each group into one value. This is in line with the MapReduce
principles of creating key value pairs, then grouping by key with the individual values of same-key
entries in an associated to the key sequence and the reducing this to one aggregate value that
represents all the values under the common key.
10. Although in Spark the map and reduce process differs from the Hadoop Map Reduce
implementation (Owen, 2014) specialised Spark functions such as reduceByKey, groupByKey and
flatMap can deploy equivalent functionality.
In Spark the above procedure represents a transformation, which is lazily evaluated when an action
is performed i.e. when an answer from the system is explicitly requested.
For example the observations are grouped by gender and then a function is applied to the
groups that outputs the mean age by gender (22.9 for males and 40.8 for women).
Other algorithms are used to make transformations of the existing variables to new ones,
for example number of clicks and impressions are used together as a ratio to produce the
click-through rate.
In other cases, SQL type analysis is deployed for example to filter the observations, by
keeping only a subset that for instance belongs to the 25-35 age group and then focus the
analysis on the specific segment.
There are also interesting implementations of summary statistics including the count of
observations (458441)
The mean values for the key variables – for example the average age which is 29.4 years,
the average number of impressions which is 5 and clicks (just below 0.1)
We can also see that the minimum reported age is 0 and maximum 99, the max number of
ad impressions is 9 and maximum number of clicks 4
The Pearson correlation between impressions and clicks is 0.13, which a positive but
relatively weak one, implying that more ad impressions do not always lead to more clicks.
The algorithms used along with the full results are presented in detail in the attached Jupyter
notebook.
One of the challenges working with large scale data is the need to use distributed frameworks for
the computation which translates to the need to use suitable data structures. In spark the core data
structure is the RDD Resilient Distributed Dataset, which is essentially a collection of elements that
can be partitioned across the nodes of a cluster.
Data manipulation with RDDs is not as intuitive and expressive for data analysis as other data
structures. Given the introduction of Spark data frames, that was actually the data structure
selected for the analysis which thanks to the named columns attribute that supports makes the
performed data analysis tasks more intuitive as well as efficient in terms of computational speed
compared with the RDDs.
Works Cited
Anon., 2015. Spark programming-guide. [Online]
Available at: spark.apache.org/docs/latest/programming-guide.html
Owen, S., 2014. how-to-translate-from-mapreduce-to-apache-spark. [Online]
Available at: https://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-
apache-spark/
Schutt, R., 2013. Doing Data Science. [Online]
Available at: https://github.com/oreillymedia/doing_data_science