df <- fread("../DataSet/cleaned_accepted_2013_to_2018Q4.csv")
col_names = c("Debt-To-Income Ratio", "Risk_Score","Application Date","Employment Length","Amount Requested")
dfr <- fread("../DataSet/filtered_rejected_2013_to_2018Q.csv", select = col_names)
#dfr <- fetch_subset(df_location = "../DataSet/filtered_rejected_2013_to_2018Q.csv", col_names = col_names, nrows = c(1,-1))
set.seed(1) # For reproducibility
dfr <- dfr %>% sample_frac(0.20)
Since 2006, LendingClub has been a U.S. financial services company that has facilitated loan contracts. Our research will focus on LendingClub’s loan data that consist of approved and rejected loans between 2013-2018. To understand LendingClub’s clientele, we will analyze borrower and loan characteristics to assess risk levels, determine loan acceptance, and understand what factors influence whether accepted loans will be fully repaid. To accomplish this, we will ask the following SMART questions:
How do the loan features of term, amount, and interest rate individually influence the likelihood of a loan being charged off (lender has accepted the loan as unlikely to recover) versus being fully paid?
How do variables such as dti, annual_inc, int_rate, and fico_range impact the likelihood of a loan being “fully paid” or “charged off” year over year from 2014 to 2016? What trends can be identified in the data, and how do these variables differ between the two loan statuses over time?
Do the loan grades provided to each customer correlate to their loan repayment behavior based on income and loan status?
Is there a relationship between employment length and the amount requested for rejected loans and how does it compare to the customers that had accepted loans?
Funded Amount: The total amount of the loan to be paid off.
Loan Status: Whether the loan has been fully paid or has been charged off. A loan that is fully paid is defined as a loan where the total payment has been met. A charge off status is defined as the lender accepting the loan as uncollectible. This differs from the default status in which payments are late but the lender expects to collect what the borrower still owes.
Term: How long the borrower has to pay off the loan (Either 36 or 60 Months).
Interest Rate: The percentage of the loan amount that the lender charges the borrower for borrowing money.
DTI (Debt to Income Ratio): A measure used to evaluate a borrower’s financial situation by comparing their debt payments to their gross monthly income.
Annual Income: The dollar amount the customer earns on a yearly basis.
FICO Range: A measure of creditworthiness used by lenders to assess a borrower’s ability to repay a loan.
Loan Grades: Classifies the risk level of a loan from A (less risky) to G (risky).
Employment Length: The amount in years the customer has been employed.
The datasets used in this research are collected from the Lending Club loan data, which is available on Kaggle: Lending Club Loan Data. The data contains information about both accepted and rejected loan applications from 2013 to 2018.
print("Accepted Dataset:")
[1] "Accepted Dataset:"
str(df) # Displaying structure of accepted dataset
Classes 'data.table' and 'data.frame': 2157894 obs. of 110 variables:
$ id : int 68407277 68355089 68341763 66310712 68476807 68426831 68476668 67275481 68466926 68616873 ...
$ loan_amnt : num 3600 24700 20000 35000 10400 ...
$ funded_amnt : num 3600 24700 20000 35000 10400 ...
$ funded_amnt_inv : num 3600 24700 20000 35000 10400 ...
$ term : chr "36 months" "36 months" "60 months" "60 months" ...
$ int_rate : num 14 12 10.8 14.8 22.4 ...
$ installment : num 123 820 433 830 290 ...
$ grade : chr "C" "C" "B" "C" ...
$ sub_grade : chr "C4" "C1" "B4" "C5" ...
$ emp_title : chr "leadman" "Engineer" "truck driver" "Information Systems Officer" ...
$ emp_length : chr "10+ years" "10+ years" "10+ years" "10+ years" ...
$ home_ownership : chr "MORTGAGE" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
$ annual_inc : num 55000 65000 63000 110000 104433 ...
$ verification_status : chr "Not Verified" "Not Verified" "Not Verified" "Source Verified" ...
$ issue_d : IDate, format: "2015-12-01" "2015-12-01" ...
$ loan_status : chr "Fully Paid" "Fully Paid" "Fully Paid" "Current" ...
$ pymnt_plan : chr "n" "n" "n" "n" ...
$ url : chr "https://lendingclub.com/browse/loanDetail.action?loan_id=68407277" "https://lendingclub.com/browse/loanDetail.action?loan_id=68355089" "https://lendingclub.com/browse/loanDetail.action?loan_id=68341763" "https://lendingclub.com/browse/loanDetail.action?loan_id=66310712" ...
$ purpose : chr "debt_consolidation" "small_business" "home_improvement" "debt_consolidation" ...
$ title : chr "Debt consolidation" "Business" "" "Debt consolidation" ...
$ zip_code : chr "190xx" "577xx" "605xx" "076xx" ...
$ addr_state : chr "PA" "SD" "IL" "NJ" ...
$ dti : num 5.91 16.06 10.78 17.06 25.37 ...
$ delinq_2yrs : num 0 1 0 0 1 0 0 1 0 0 ...
$ earliest_cr_line : chr "Aug-2003" "Dec-1999" "Aug-2000" "Sep-2008" ...
$ fico_range_low : num 675 715 695 785 695 690 680 705 685 700 ...
$ fico_range_high : num 679 719 699 789 699 694 684 709 689 704 ...
$ inq_last_6mths : num 1 4 0 0 3 0 0 0 1 0 ...
$ mths_since_last_delinq : num 30 6 NA NA 12 NA 49 3 NA 75 ...
$ open_acc : num 7 22 6 13 12 5 12 8 14 8 ...
$ pub_rec : num 0 0 0 0 0 0 0 0 1 0 ...
$ revol_bal : num 2765 21470 7869 7802 21929 ...
$ revol_util : num 29.7 19.2 56.2 11.6 64.5 68.4 84.5 5.7 34.5 39.1 ...
$ total_acc : num 13 38 18 17 35 6 27 15 23 18 ...
$ initial_list_status : chr "w" "w" "w" "w" ...
$ out_prncp : num 0 0 0 15898 0 ...
$ out_prncp_inv : num 0 0 0 15898 0 ...
$ total_pymnt : num 4422 25680 22706 31464 11740 ...
$ total_pymnt_inv : num 4422 25680 22706 31464 11740 ...
$ total_rec_prncp : num 3600 24700 20000 19102 10400 ...
$ total_rec_int : num 822 980 2706 12362 1340 ...
$ total_rec_late_fee : num 0 0 0 0 0 0 0 0 0 0 ...
$ recoveries : num 0 0 0 0 0 0 0 0 0 0 ...
$ collection_recovery_fee : num 0 0 0 0 0 0 0 0 0 0 ...
$ last_pymnt_d : chr "Jan-2019" "Jun-2016" "Jun-2017" "Feb-2019" ...
$ last_pymnt_amnt : num 123 926 15813 830 10129 ...
$ next_pymnt_d : chr "" "" "" "Apr-2019" ...
$ last_credit_pull_d : chr "Mar-2019" "Mar-2019" "Mar-2019" "Mar-2019" ...
$ last_fico_range_high : num 564 699 704 679 704 759 654 674 719 679 ...
$ last_fico_range_low : num 560 695 700 675 700 755 650 670 715 675 ...
$ collections_12_mths_ex_med : num 0 0 0 0 0 0 0 0 0 0 ...
$ policy_code : num 1 1 1 1 1 1 1 1 1 1 ...
$ application_type : chr "Individual" "Individual" "Joint App" "Individual" ...
$ acc_now_delinq : num 0 0 0 0 0 0 0 0 0 0 ...
$ tot_coll_amt : num 722 0 0 0 0 ...
$ tot_cur_bal : num 144904 204396 189699 301500 331730 ...
$ open_acc_6m : num 2 1 0 1 1 0 0 0 2 0 ...
$ open_act_il : num 2 1 1 1 3 1 2 3 1 2 ...
$ open_il_12m : num 0 0 0 0 0 0 0 0 0 2 ...
$ open_il_24m : num 1 1 4 1 3 0 2 4 0 3 ...
$ mths_since_rcnt_il : num 21 19 19 23 14 338 18 13 35 10 ...
$ total_bal_il : num 4981 18005 10827 12609 73839 ...
$ il_util : num 36 73 73 70 84 99 63 75 57 72 ...
$ open_rv_12m : num 3 2 0 1 4 0 2 0 2 0 ...
$ open_rv_24m : num 3 3 2 1 7 0 3 0 7 2 ...
$ max_bal_bc : num 722 6472 2081 6987 9702 ...
$ all_util : num 34 29 65 45 78 76 74 55 46 49 ...
$ total_rev_hi_lim : num 9300 111800 14000 67300 34000 ...
$ inq_fi : num 3 0 2 0 2 0 1 1 2 0 ...
$ total_cu_tl : num 1 0 5 1 1 0 0 0 0 0 ...
$ inq_last_12m : num 4 6 1 0 3 0 1 2 1 1 ...
$ acc_open_past_24mths : num 4 4 6 2 10 0 6 4 7 5 ...
$ avg_cur_bal : num 20701 9733 31617 23192 27644 ...
$ bc_open_to_buy : num 1506 57830 2737 54962 4567 ...
$ bc_util : num 37.2 27.1 55.9 12.1 77.5 ...
$ chargeoff_within_12_mths : num 0 0 0 0 0 0 0 0 0 0 ...
$ delinq_amnt : num 0 0 0 0 0 0 0 0 0 0 ...
$ mo_sin_old_il_acct : num 148 113 125 36 128 338 142 149 164 155 ...
$ mo_sin_old_rev_tl_op : num 128 192 184 87 210 54 306 55 129 253 ...
$ mo_sin_rcnt_rev_tl_op : num 3 2 14 2 4 32 10 32 1 15 ...
$ mo_sin_rcnt_tl : num 3 2 14 2 4 32 10 13 1 10 ...
$ mort_acc : num 1 4 5 1 6 0 4 3 1 1 ...
$ mths_since_recent_bc : num 4 2 101 2 4 36 12 32 4 50 ...
$ mths_since_recent_inq : num 4 0 10 NA 1 NA 10 8 1 10 ...
$ mths_since_recent_revol_delinq: num 69 6 NA NA 12 NA NA NA NA NA ...
$ num_accts_ever_120_pd : num 2 0 0 0 0 0 0 1 0 1 ...
$ num_actv_bc_tl : num 2 5 2 4 4 2 4 2 6 3 ...
$ num_actv_rev_tl : num 4 5 3 5 6 3 6 2 9 3 ...
$ num_bc_sats : num 2 13 2 8 5 2 4 3 7 3 ...
$ num_bc_tl : num 5 17 4 10 9 2 5 3 10 6 ...
$ num_il_tl : num 3 6 6 2 10 2 7 9 3 5 ...
$ num_op_rev_tl : num 4 20 4 10 7 4 9 3 13 5 ...
$ num_rev_accts : num 9 27 7 13 19 4 16 3 19 11 ...
$ num_rev_tl_bal_gt_0 : num 4 5 3 5 6 3 6 2 9 3 ...
$ num_sats : num 7 22 6 13 12 5 12 8 14 8 ...
$ num_tl_120dpd_2m : num 0 0 0 0 0 0 0 0 0 0 ...
$ num_tl_30dpd : num 0 0 0 0 0 0 0 0 0 0 ...
$ num_tl_90g_dpd_24m : num 0 0 0 0 0 0 0 1 0 0 ...
$ num_tl_op_past_12m : num 3 2 0 1 4 0 2 0 2 2 ...
[list output truncated]
- attr(*, ".internal.selfref")=<externalptr>
The Rejected Dataset comprises loan applications that were not approved by Lending Club. This dataset includes important information on why certain applicants were denied, based on factors like their debt-to-income ratio and risk score.
print("Rejected Dataset:")
[1] "Rejected Dataset:"
str(dfr) # Displaying structure of rejected dataset
Classes 'data.table' and 'data.frame': 5378466 obs. of 5 variables:
$ Debt-To-Income Ratio: chr "3.52%" "36.2%" "14.15%" "22.1%" ...
$ Risk_Score : num NA NA 653 NA NA NA NA 713 NA NA ...
$ Application Date : IDate, format: "2017-07-08" "2018-05-14" ...
$ Employment Length : chr "< 1 year" "< 1 year" "< 1 year" "< 1 year" ...
$ Amount Requested : num 1000 10000 20000 30000 1000 1000 35000 19000 10000 12000 ...
- attr(*, ".internal.selfref")=<externalptr>
These datasets are pivotal for analyzing the factors contributing to the approval or rejection of loan applications, helping us understand credit risk management and lending patterns during the specified time period.
How do the loan features of term, amount, and interest rate individually influence the likelihood of a loan being charged off versus being fully paid?
This question focuses on assessing risk levels and looking into trends of when the company can expect to see a loan charged off (lender has accepted the loan as unlikely to recover) or being paid off.
Assumptions: In terms of loan status, we are only analyzing the “Fully Paid” and “Charge Off” status.
For this question, we will be looking at funded amount, loan status, term, and interest rate.
Based on this data, the funded amount and interest rate are data types that are numerical. On the other hand, loan status and term have been converted to factors and are categorical data types. While loan status logically makes sense to be categorical, term has been converted to a factor since there were only two types of loan terms (36 or 60 months) which can be seen as short term vs. long term when comparing the likelihood of default across loans. Term in this case will be viewed as nonlinear with interest rates and funded amount affecting risk with linearity.
To retrieve this information, data cleaning was done on the accepted loans dataset that initially contained 2,157,894 million observations. After filtering, the total observations stand at 1,245,285 observations.
#Subset for Specific Columns
f_df<- df[, c("funded_amnt", "loan_status","term","int_rate")]
#Filter for specific loan status
df_filter <- f_df %>%
filter(loan_status %in% c("Charged Off", "Fully Paid"))
rm(f_df)
#Convert loan status and term as factors
df_filter$loan_status <- as.factor(df_filter$loan_status)
df_filter$term <- as.factor(df_filter$term)
#Select Relevant Columns
df_final <-df_filter %>%
select(funded_amnt, term, int_rate, loan_status)
rm(df_filter)
#Check the Dataframe output
str(df_final)
Classes 'data.table' and 'data.frame': 1245285 obs. of 4 variables:
$ funded_amnt: num 3600 24700 20000 10400 11950 ...
$ term : Factor w/ 2 levels "36 months","60 months": 1 1 2 2 1 1 1 1 1 1 ...
$ int_rate : num 14 12 10.8 22.4 13.4 ...
$ loan_status: Factor w/ 2 levels "Charged Off",..: 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, ".internal.selfref")=<externalptr>
amount_plot = outlierKD2(df_final, funded_amnt, rm = TRUE, qqplt = TRUE, histogram = FALSE)
Outliers identified: 7157
Proportion (%) of outliers: 0.6
Mean of the outliers: 39937
Mean without removing outliers: 14558
Mean if we remove outliers: 14411
Outliers successfully removed
rm(amount_plot)
The loan amount data does not appear to follow a normal distribution based on the QQ plot, as there are significant deviations from the reference line, particularly at the tails. The lower quantiles fall below the line, while the upper quantiles rise far above it, indicating the presence of heavy tails. Additionally, the stepped line in the middle suggests that the data isn’t evenly distributed. Even with the outliers removed, there are still extreme deviations at both ends therefore showing that this is not a normal distribution.
rate_plot = outlierKD2(df_final, int_rate, rm = TRUE, qqplt = TRUE, histogram = FALSE)
Outliers identified: 24826
Proportion (%) of outliers: 2
Mean of the outliers: 27.6
Mean without removing outliers: 13.3
Mean if we remove outliers: 13
Outliers successfully removed
rm(rate_plot)
The interest rate also shows similar trends. There are major deviations from the reference line, especially in the upper tail, indicating that the interest rate distribution is not normal when outliers are present. Even with the outliers removed, the QQ plot appears much more linear showing significant improvement towards normality, though there are deviations that remain in the tails.
Overall, both loan amount and interest rate distributions are not normal when outliers are included. While showing improvement in normality when removing outliers, deviations still occur at the extremes. As a result, these datasets don’t follow normality due to these deviations.
Initial statistics summarizes data that can help in identifying patterns, trends, and outliers. To demonstrate this, loan status will be analyzed against each loan feature (term, amount and interest rate) to see any relationships and patterns.
# Contingency Table (Correlation Analysis)
c_table <- table(df_final$term, df_final$loan_status)
cat("Contingency Table: Loan Status by Term\n\n")
Contingency Table: Loan Status by Term
print(c_table)
Charged Off Fully Paid
36 months 153426 789214
60 months 99769 202876
#Contingency Table by Proportions
prop_table <-prop.table(c_table) *100
cat("Contingency Table: Loan Status by Term Proportions (by %)\n\n")
Contingency Table: Loan Status by Term Proportions (by %)
print(prop_table)
Charged Off Fully Paid
36 months 12.32 63.38
60 months 8.01 16.29
rm(c_table)
# Stacked Bar Chart
ggplot(df_final, aes(x = term, fill = loan_status)) +
geom_bar(position = "stack") +
labs(x="Loan Term", y="Number of Loans",title="Stacked Bar Chart of Loan Status by Term" ) +
scale_fill_manual(values = c("navy", "lightblue"))
Via the first contingency table, 36 month loans have a higher absolute number of both fully paid and charged off loans compared to 60 month loans. Proportionally, 36 month loans have a higher rate of being charged off (12.32%) than 60 month loans (8.01%).
When visually shown on a stacked bar chart, you’ll notice that shorter terms are more likelier to get charged off than loans with a longer term.
Overall, shorter loan terms are likelier to result in a charge off and appear to influence loan status.
# Descriptive Stats of Loan Amount Separated by Loan Status
amnt_stats <- df_final %>%
group_by(loan_status) %>%
summarise(
Mean = mean(funded_amnt, na.rm = TRUE),
Median = median(funded_amnt, na.rm = TRUE),
SD = sd(funded_amnt, na.rm = TRUE),
Q25 = quantile(funded_amnt, 0.25, na.rm = TRUE),
Q50 = quantile(funded_amnt, 0.50, na.rm = TRUE),
Q75 = quantile(funded_amnt, 0.75, na.rm = TRUE),
Min = min(funded_amnt, na.rm = TRUE),
Max = max(funded_amnt, na.rm = TRUE)
)
cat("Descriptive Stats of Loan Amounts Separated by Loan Status)\n\n")
Descriptive Stats of Loan Amounts Separated by Loan Status)
print(amnt_stats)
# A tibble: 2 × 9
loan_status Mean Median SD Q25 Q50 Q75 Min Max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Charged Off 15666. 14400 8828. 9000 14400 20750 1000 40000
2 Fully Paid 14275. 12000 8722. 7500 12000 20000 1000 40000
rm(amnt_stats)
# Box Plot Comparing Loan Amounts by Loan Status
ggplot(df_final, aes(x = loan_status, y=funded_amnt, fill = loan_status)) +
geom_boxplot() +
geom_boxplot( colour="black", outlier.colour="blue", outlier.shape=8, outlier.size=4) +
labs(title="Boxplot of Loan Amounts by Loan Status", x="Loan Status", y = "Loan Amount ($)")+
scale_fill_manual(values = c("navy", "lightblue"))
Based on the statistics, the charged off loans having a higher mean and median loan amount, this can suggest that larger loan amounts may be associated with a higher risk of being charged off. Charged off loans also have a slightly higher standard deviation showing for more variability in the amounts.Lastly, both loan statuses show the same minimum and maximum but that does not necessarily mean that the distributions are identical.
Based on the boxplot, one commonality is the several outliers that exist around $40,000 for both loan statuses. However, this is not indicative that it is caused by status and requires further investigation through these high amounts being affected by other factors. Given the IQR of both statuses, there is not significant variability. You can however tell based on the IQR that the charged off loans have a slightly higher loan amounts compared to the fully paid loans that have a slightly lower loan amount.
Overall, larger loan amounts may be associated with a higher charged off risk. However holistically, loan amount alone does not appear to have a strong influence on whether a loan will be charged off or fully paid, given overlapping distributions.
# Descriptive Stats of Interest Rate Separated by Loan Status
rate_stats <- df_final %>%
group_by(loan_status) %>%
summarise(
Mean = mean(int_rate, na.rm = TRUE),
Median = median(int_rate, na.rm = TRUE),
SD = sd(int_rate, na.rm = TRUE),
Q25 = quantile(int_rate, 0.25, na.rm = TRUE),
Q50 = quantile(int_rate, 0.50, na.rm = TRUE),
Q75 = quantile(int_rate, 0.75, na.rm = TRUE),
Min = min(int_rate, na.rm = TRUE),
Max = max(int_rate, na.rm = TRUE)
)
cat("Descriptive Stats of Interest Rates Separated by Loan Status)\n\n")
Descriptive Stats of Interest Rates Separated by Loan Status)
print(rate_stats)
# A tibble: 2 × 9
loan_status Mean Median SD Q25 Q50 Q75 Min Max
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Charged Off 15.8 15.0 4.95 12.3 15.0 18.6 5.31 31.0
2 Fully Paid 12.6 12.1 4.56 9.17 12.1 15.2 5.31 31.0
rm(rate_stats)
df_count <-df_final %>%
filter(int_rate >0) %>%
mutate(int_rate_bin = floor(int_rate/2)*2) %>%
filter(int_rate_bin >0) %>%
count(int_rate_bin, loan_status)
ggplot(df_count, aes(x = int_rate_bin, y = n, color = loan_status)) +
geom_line(size = 1)+
labs(x="Interest Rate (%)", y="Count", title = "Distribution of Interest Rates by Loan Status") +
scale_color_manual(values = c("navy", "lightblue"))
rm(df_count)
rm(df_final)
Based on the statistics, higher interest rates significantly increase the likelihood of a loan being charged off. Charged off loans have a mean interest rate of 15.8%, compared to 12.6% for fully paid loans, with a similar difference in the median (15.1% vs. 12.1%). Even at the lower end of the interest rate range, loans that default tend to have higher rates. As charged off loans have a higher standard deviation and higher 75th percentile, this suggests greater volatility and risk. This indicates that interest rates are a strong predictor of loan defaults.
Similar to the statistics, the histogram shows interest rates having a strong influence on loan outcomes. Loans with a higher interest rate of around 15% and above are more likely to be charged off (in proportion to fully paid), indicating a greater risk of default at these levels. In contrast, lower interest rates between 10% to 12% are associated with a higher frequency of fully paid loans, suggesting that borrowers with lower rates are more likely to repay their loans in full. There is also a noticeable overlap in the 10% to 15% interest rate range, where fully paid and charged off loans occur. But as interest rates increase, the likelihood of a loan being charged off rises significantly. This suggests that interest rates are a key predictor of loan performance, with higher rates correlating strongly with loan defaults and lower rates with successful repayment.
Overall, loans with higher interest rates are likelier to be charged off, making interest rates a strong influence of loan charge offs.
columns_to_target = c("annual_inc", "dti", "fico_range_high", "fico_range_low", "loan_amnt",
"int_rate", "earliest_cr_line", "revol_util", "delinq_2yrs", "pub_rec",
"total_acc", "open_acc", "installment", "home_ownership",
"verification_status", "delinq_amnt", "collections_12_mths_ex_med",
"chargeoff_within_12_mths", "mths_since_last_delinq", "purpose", "sub_grade", "issue_d", "addr_state","inq_last_12m","loan_status","chargeoff_within_12_mths")
#df <- fetch_subset(df_location = df_file_path, col_names = columns_to_target, nrows = c(1,-1))
df_final <- df %>% select(all_of(columns_to_target))
convert_to_factors <- function(df) {
# Columns to convert to factors based on the provided list
factor_cols <- c("home_ownership", "verification_status",
"loan_status", "purpose", "addr_state","sub_grade")
df[] <- lapply(names(df), function(col) {
if (col %in% factor_cols) {
# Check if the column contains numeric-like values as strings and standardize them
if (col == "chargeoff_within_12_mths") {
# Convert '0.0' and '0' to '0', and handle similar cases
df[[col]] <- as.character(df[[col]])
df[[col]][df[[col]] == "0.0"] <- "0"
df[[col]][df[[col]] == "1.0"] <- "1" # If applicable, handle '1.0' similar to '1'
return(as.factor(df[[col]])) # Convert the cleaned column to factor
} else {
return(as.factor(df[[col]])) # Convert other factor columns directly
}
} else if (is.character(df[[col]]) && all(!is.na(as.numeric(df[[col]][-1])))) {
return(as.numeric(df[[col]])) # Convert numeric-like strings to numeric
} else {
return(df[[col]]) # Leave other columns unchanged
}
})
return(df)
}
df_final <- df_final[-1,]
df_final <- convert_to_factors(df_final)
#str(df_final)
# Convert specified columns to numeric
df_final$revol_util <- as.numeric(df_final$revol_util)
df_final$mths_since_last_delinq <- as.numeric(df_final$mths_since_last_delinq)
df_final$dti <- as.numeric(df_final$dti)
df_final$inq_last_12m <- as.numeric(df_final$inq_last_12m)
df_final$chargeoff_within_12_mths <- as.numeric(df_final$chargeoff_within_12_mths)
# Optional: Check for any warnings about NAs being introduced during coercion
if (any(is.na(df_final$revol_util))) {
warning("NAs introduced in 'revol_util' during conversion.")
}
if (any(is.na(df_final$mths_since_last_delinq))) {
warning("NAs introduced in 'mths_since_last_delinq' during conversion.")
}
if (any(is.na(df_final$dti))) {
warning("NAs introduced in 'dti' during conversion.")
}
df_final$issue_d <- as.Date(df_final$issue_d, format = "%Y-%m-%d")
How do variables such as dti, annual_inc, int_rate, and fico_range impact the likelihood of a loan being fully paid or charged off year over year from 2013 to 2018? What trends can be identified in the data, and how do these variables differ between the two loan statuses over time?
In this analysis, we aim to explore how key financial variables such as debt-to-income ratio (DTI), annual income, interest rate, and FICO score range affect the likelihood of a loan being fully paid or charged off between 2013 and 2018. By analyzing these variables over time, we can identify trends and patterns that may influence loan outcomes. Specifically, we seek to understand how these factors differ between loans that were successfully repaid versus those that were charged off, shedding light on risk factors and lending dynamics during this period.
In this section, we analyze the percentage of loans that were charged off (i.e., loans that borrowers were unable to fully repay) between 2013 and 2018. By aggregating loan data on a quarterly basis, we examine the trends in charged-off loans over time to gain insights into changes in lending risk and borrower behavior. Specifically, we explore how the charged-off loan percentage fluctuated over the years and consider potential factors that could explain these changes.
# LINE PLOT
# Create a copy of the dataframe
df1 <- df_final
# Step 1: Ensure 'issue_d' is in Date format
df1$issue_d <- as.Date(df1$issue_d, format = "%Y-%m-%d") # Adjust the format if necessary
# Step 2: Extract year and quarter from 'issue_d'
df1 <- df1 %>%
mutate(year = year(issue_d),
quarter = quarter(issue_d))
# Step 3: Filter out rows with NA values in year or quarter
df1 <- df1 %>%
filter(!is.na(year), !is.na(quarter))
# Step 4: Calculate the total number of loans per quarter
df_total_per_quarter <- df1 %>%
group_by(year, quarter) %>%
summarise(total_loans = n(), .groups = 'drop') # Adding .groups = 'drop' to avoid warnings
# Step 5: Filter for 'Charged Off' loan status
df_charged_off <- df1 %>%
filter(loan_status == "Charged Off")
# Step 6: Aggregate 'Charged Off' counts by year and quarter
df_charged_off_aggregated <- df_charged_off %>%
group_by(year, quarter) %>%
summarise(charged_off_count = n(), .groups = 'drop') # Adding .groups = 'drop' to avoid warnings
# Step 7: Merge the total loan counts with the 'Charged Off' counts
df_merged <- df_charged_off_aggregated %>%
left_join(df_total_per_quarter, by = c("year", "quarter")) %>%
mutate(percentage = (charged_off_count / total_loans) * 100)
# Step 8: Create a Year-Quarter variable in the correct format and order
df_merged <- df_merged %>%
mutate(Year_Quarter = paste0(year, ".", quarter)) %>%
arrange(year, quarter) # Arrange the data to ensure it's in the correct order
# Step 9: Convert Year_Quarter into a factor with levels in the correct chronological order
df_merged$Year_Quarter <- factor(df_merged$Year_Quarter,
levels = df_merged$Year_Quarter)
# Step 10: Create the line plot with percentage and display percentage as text on each point
ggplot(df_merged, aes(x = Year_Quarter, y = percentage, group = 1)) +
geom_area(fill = "red", alpha = 0.3) + # Fill the area under the line with color and adjust opacity
geom_line(color = "red", size = 1) + # Line color and size
geom_point(color = "red", size = 2) + # Point color and size
geom_text(aes(label = sprintf("%.1f%%", percentage)),
vjust = -0.5, size = 3, nudge_y = 0.5) + # Display percentage above points
labs(x = "Year-Quarter", y = "Charged Off Percentage", title = "Percentage of Charged Off Loans by Year and Quarter") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), # Rotate x-axis labels
panel.grid.major.y = element_blank(), # Remove horizontal grid lines
panel.grid.minor.y = element_blank()) # Remove minor horizontal grid lines
rm(df_charged_off)
rm(df1)
rm(df_charged_off_aggregated)
rm(df_merged)
rm(df_total_per_quarter)
The charged-off percentage increases from Q1 2013 (14.9%) to Q2 2015 (18.5%), then decreases drastically until Q4 2018. There could be two explanations for this decline:
In this section, we conduct hypothesis testing on key financial features to examine their influence on loan outcomes, specifically the likelihood of a loan being charged off. Using Welch’s t-test, we compare loans issued before and after Q2 2015 to identify significant differences in features such as annual income, debt-to-income ratio, interest rate, FICO score, and other variables that may impact loan status.
For each feature, the null hypothesis (H0) and alternative hypothesis (H1) are as follows:
Based on the p-values obtained from the t-tests, we determine whether to reject the null hypothesis in favor of the alternative hypothesis. The results help us identify which features play a significant role in determining loan outcomes over time.
# Step 1: Verify and convert 'issue_d' to Date if necessary
df_final$issue_d <- as.Date(df_final$issue_d, format = "%Y-%m-%d") # Adjust the format if necessary
# Step 2: Filter the dataframe for 'Charged Off' loan_status and remove NaN values
charged_off_df <- df_final %>%
filter(loan_status == "Charged Off") %>%
filter(!is.na(issue_d)) # Ensure we have dates to work with
# Step 3: Create a new column for quarters if not already created
charged_off_df <- charged_off_df %>%
mutate(quarter = quarter(issue_d),
year = year(issue_d),
quarter_year = as.integer(format(issue_d, "%Y")) * 10 + ceiling(as.integer(format(issue_d, "%m")) / 3))
# Step 4: Check unique quarters available in the dataset
unique_quarters <- unique(charged_off_df$quarter_year)
#cat("Unique quarters in the dataset:", unique_quarters, "\n")
# Step 5: Split the data into two groups: before and including Q2 2015, and after Q2 2015
before_2015_Q2 <- charged_off_df %>%
filter(year < 2015 | (year == 2015 & quarter <= 2))
after_2015_Q2 <- charged_off_df %>%
filter(year > 2015 | (year == 2015 & quarter > 2))
# Proceed only if there are records in both groups
if (nrow(before_2015_Q2) > 0 && nrow(after_2015_Q2) > 0) {
# Step 6: Identify numerical features in the filtered dataframe
numerical_cols <- before_2015_Q2 %>%
select_if(is.numeric) %>%
colnames()
# Step 7: Perform T-test for each numeric column, excluding those with zero observations in either group
alpha <- 0.05 # Significance level
t_test_results <- lapply(numerical_cols, function(col) {
# Extract the two groups
group1 <- before_2015_Q2[[col]]
group2 <- after_2015_Q2[[col]]
# Check if both groups have enough non-NA observations
if (length(na.omit(group1)) > 0 && length(na.omit(group2)) > 0) {
# Perform the t-test
t_test_result <- t.test(group1, group2, var.equal = FALSE) # Welch's t-test
# Determine hypothesis status
hypothesis_status <- ifelse(t_test_result$p.value <= alpha, "Reject H0", "Fail to Reject H0")
return(data.frame(Feature = col, T_value = t_test_result$statistic, P_value = t_test_result$p.value, Hypothesis_Status = hypothesis_status))
} else {
return(NULL) # Return NULL for insufficient data
}
})
# Step 8: Remove NULL results and combine results into a dataframe
t_test_results_df <- do.call(rbind, Filter(Negate(is.null), t_test_results))
# Step 9: Remove unwanted features
t_test_results_filtered <- t_test_results_df %>%
filter(!Feature %in% c("quarter", "year", "quarter_year"))
# Step 10: Print the filtered results
print(t_test_results_filtered)
} else {
cat("Not enough records in one of the groups to perform t-tests.\n")
}
Feature T_value P_value Hypothesis_Status
t annual_inc -14.29 2.66e-46 Reject H0
t1 dti -20.26 3.66e-91 Reject H0
t2 fico_range_high -21.90 3.33e-106 Reject H0
t3 fico_range_low -21.90 3.32e-106 Reject H0
t4 loan_amnt -1.84 6.65e-02 Fail to Reject H0
t5 int_rate 7.60 2.99e-14 Reject H0
t6 revol_util 65.33 0.00e+00 Reject H0
t7 delinq_2yrs -6.79 1.11e-11 Reject H0
t8 pub_rec -18.93 7.76e-80 Reject H0
t9 total_acc 21.91 2.49e-106 Reject H0
t10 open_acc -6.28 3.30e-10 Reject H0
t11 installment -16.88 6.75e-64 Reject H0
t12 delinq_amnt -4.22 2.42e-05 Reject H0
t13 collections_12_mths_ex_med -15.26 1.42e-52 Reject H0
t14 chargeoff_within_12_mths -2.94 3.32e-03 Reject H0
t15 mths_since_last_delinq 1.18 2.39e-01 Fail to Reject H0
rm(after_2015_Q2)
rm(before_2015_Q2)
rm(charged_off_df)
rm(t_test_results)
rm(t_test_results_df)
rm(t_test_results_filtered)
In this section, we analyze the relationships between key financial variables using a correlation matrix. The correlation plot helps identify how different variables, such as income, loan amount, credit score, interest rate, and other features, interact with each other. By understanding these correlations, we can gain insights into how specific factors, like income or credit score, may influence loan characteristics and outcomes.
The correlation matrix is visualized to highlight both positive and negative relationships between variables, offering a comprehensive overview of how these financial features are interconnected and how they might affect borrower behavior or loan performance.
#COOR PLOT
setDT(df_final) # Converts df to a data.table, if it isn't one already
# Select only numeric columns from the data table
numeric_df <- df_final[, .SD, .SDcols = sapply(df_final, is.numeric)]
# Calculate the correlation matrix
cor_matrix <- cor(numeric_df, use = "pairwise.complete.obs")
# Set plotting parameters for larger plot size
# Create a larger plot window
options(repr.plot.width = 35, repr.plot.height = 45) # Use this in RMarkdown or Jupyter Notebooks
# Plot the correlation matrix using corrplot without the title
corrplot(cor_matrix,
type = "upper",
method = "circle",
addCoef.col = 0.2, # Color for coefficients
number.cex = 0.1, # Size of the correlation coefficients
tl.cex = 0.6, # Size of the text labels for variables
tl.col = "black", # Color of the text labels
cl.cex = 0.5) # Size of the color legend text
# Remove numeric_df
rm(numeric_df)
rm(cor_matrix)
annual_inc
has a moderate positive correlation with
loan_amnt
(0.1949) and installment
(0.188),
suggesting that individuals with higher income tend to take larger loans
with higher installments.dti
has a weak positive correlation with
int_rate
(0.1239), indicating a slight increase in interest
rates for borrowers with higher debt-to-income ratios.fico_range_high
and fico_range_low
have
strong negative correlations with int_rate
(-0.4049),
indicating that better credit scores are associated with lower interest
rates.fico_range_high
and fico_range_low
are
negatively correlated with revol_util
(-0.4747), suggesting
that individuals with higher credit scores tend to have lower revolving
credit utilization.delinq_2yrs
has a strong negative correlation with
mths_since_last_delinq
(-0.5524), indicating that more
recent delinquencies are associated with a higher number of
delinquencies in the past two years.pub_rec
shows a weak negative correlation with
loan_amnt
(-0.0628), suggesting a slight decrease in loan
amounts with higher public records of delinquencies.total_acc
and open_acc
are highly
correlated (0.7199), which is expected as they both measure account
quantities in credit reports.total_acc
has a weak positive correlation with
loan_amnt
(0.1966), indicating that borrowers with more
accounts tend to have slightly higher loan amounts.In this section, we examine the geographical distribution of charged-off loans across U.S. states, focusing on changes in loan performance before and after Q2 2015. By analyzing the percentage of charged-off loans by state, we aim to identify regional trends and assess the impact of interventions implemented during this period. Using geographic visualization, we compare the frequency of charged-off loans across different states to uncover potential regional disparities in loan outcomes, and evaluate the success of interventions in improving loan performance in specific areas.
# US MAP
# Load the state coordinates
state_coords <- read.csv("../DataSet/states.csv")
# Step 1: Verify and convert 'issue_d' to Date if necessary
df_final$issue_d <- as.Date(df_final$issue_d, format = "%Y-%m-%d") # Adjust the format if necessary
# Step 2: Filter the dataframe and remove NaN values
charged_off_df <- df_final %>%
filter(!is.na(issue_d), !is.na(addr_state)) # Ensure we have dates and states to work with
# Step 3: Create a new column for quarters
charged_off_df <- charged_off_df %>%
mutate(quarter_year = as.integer(format(issue_d, "%Y")) * 10 + ceiling(as.integer(format(issue_d, "%m")) / 3))
# Step 4: Split the data into two groups: before and including Q2 2015, and after Q2 2015
before_2015_Q2 <- charged_off_df %>%
filter(quarter_year <= 20152)
after_2015_Q2 <- charged_off_df %>%
filter(quarter_year > 20152)
# Step 5: Calculate total loans and charged off loans by state for both periods
total_loans_before <- before_2015_Q2 %>%
group_by(addr_state) %>%
summarise(total_loans = n(), .groups = "drop")
charged_off_loans_before <- before_2015_Q2 %>%
filter(loan_status == "Charged Off") %>%
group_by(addr_state) %>%
summarise(charged_off_loans = n(), .groups = "drop")
total_loans_after <- after_2015_Q2 %>%
group_by(addr_state) %>%
summarise(total_loans = n(), .groups = "drop")
charged_off_loans_after <- after_2015_Q2 %>%
filter(loan_status == "Charged Off") %>%
group_by(addr_state) %>%
summarise(charged_off_loans = n(), .groups = "drop")
# Step 6: Merge totals and charged off loans for both periods
frequency_data_before <- total_loans_before %>%
left_join(charged_off_loans_before, by = "addr_state") %>%
mutate(charged_off_loans = ifelse(is.na(charged_off_loans), 0, charged_off_loans),
percentage_charged_off = (charged_off_loans / total_loans) * 100)
frequency_data_after <- total_loans_after %>%
left_join(charged_off_loans_after, by = "addr_state") %>%
mutate(charged_off_loans = ifelse(is.na(charged_off_loans), 0, charged_off_loans),
percentage_charged_off = (charged_off_loans / total_loans) * 100)
# Step 7: Merge both frequency data with state coordinates
frequency_data <- state_coords %>%
left_join(frequency_data_before %>% select(addr_state, percentage_charged_off) %>% rename(percentage_charged_off_before = percentage_charged_off),
by = c("state" = "addr_state")) %>%
left_join(frequency_data_after %>% select(addr_state, percentage_charged_off) %>% rename(percentage_charged_off_after = percentage_charged_off),
by = c("state" = "addr_state"))
# Step 8: Filter to keep only points within the contiguous US
frequency_data <- frequency_data %>%
filter(latitude >= 24.396308 & latitude <= 49.384358 &
longitude >= -125.0 & longitude <= -66.93457)
# Step 9: Create the map
us_map <- map_data("state")
# Step 10: Plot with state symbols
ggplot() +
# Base map
geom_polygon(data = us_map, aes(x = long, y = lat, group = group), fill = "lightgrey") +
# Points for charged off loans before Q2 2015
geom_point(data = frequency_data,
aes(x = longitude, y = latitude,
size = percentage_charged_off_before,
color = "Before Q2 2015"),
alpha = 0.5) +
# Points for charged off loans after Q2 2015
geom_point(data = frequency_data,
aes(x = longitude, y = latitude,
size = percentage_charged_off_after,
color = "After Q2 2015"),
alpha = 0.5) +
# Add state abbreviations
geom_text(data = frequency_data, aes(x = longitude, y = latitude, label = state),
color = "black", size = 3, vjust = -0.8) + # Adjust text size and position
# Adjust the size scale
scale_size_continuous(
range = c(3, 10), # Adjusting the size range
name = "% of Charged Off Loans"
) +
# Define color scale
scale_color_manual(values = c("Before Q2 2015" = "red", "After Q2 2015" = "green"),
name = "Time Period") +
# Labels and title
labs(title = "% of Charged Off Loans by State (Before and After Q2 2015)",
x = "Longitude",
y = "Latitude") +
theme_minimal() +
theme(legend.position = "top") # Place legend at the top
rm(after_2015_Q2)
rm(before_2015_Q2)
rm(charged_off_df)
rm(charged_off_loans_after)
rm(charged_off_loans_before)
rm(us_map)
rm(frequency_data)
rm(frequency_data_after)
rm(state_coords)
rm(total_loans_after)
rm(total_loans_before)
rm(frequency_data_before)
The analysis highlights states like Nebraska and Iowa as significant cases with varying outcomes in charged-off loans. The successful interventions in Nebraska indicate that targeted strategies can yield substantial improvements in loan performance.
This section focuses on the debt-to-income (DTI) ratio and its relationship with loan outcomes, specifically comparing charged-off and fully paid loans between 2013 and 2018. The DTI ratio, a key metric used to assess borrowers’ ability to manage loan repayments, is analyzed to identify its role as a potential risk factor for loan defaults. By visualizing DTI distributions across different quarters using box plots, we highlight differences between charged-off and fully paid loans and explore trends over time, providing insights into how DTI may influence loan performance.
# BOX PLOT
df1 <- df_final
# Ensure issue_d is in Date format
df1$issue_d <- as.Date(df1$issue_d, format = "%b-%Y")
# Filter out rows where issue_d is NA
df1 <- df1 %>%
filter(!is.na(issue_d))
# Extract year and quarter from issue_d
df1$quarter <- paste0("Q", quarter(df1$issue_d), " ", year(df1$issue_d))
# Filter the data for only "Charged Off" and "Fully Paid" loan statuses
df_filtered <- df1 %>%
filter(loan_status %in% c("Charged Off", "Fully Paid"))
# Define a function to remove outliers based on IQR
remove_outliers <- function(data) {
Q1 <- quantile(data$dti, 0.25, na.rm = TRUE)
Q3 <- quantile(data$dti, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
data %>%
filter(dti >= lower_bound & dti <= upper_bound)
}
# Remove outliers by applying the function to each quarter and loan_status group
df_no_outliers <- df_filtered %>%
group_by(quarter, loan_status) %>%
do(remove_outliers(.))
# Ensure proper order of quarters
df_no_outliers$quarter <- factor(df_no_outliers$quarter,
levels = unique(df_no_outliers$quarter[order(df_no_outliers$issue_d)]))
# Plot box plot for dti for each quarter with loan_status distinction (without outliers)
ggplot(df_no_outliers, aes(x = quarter, y = dti, fill = loan_status)) +
geom_boxplot() +
labs(title = "DTI Box Plot by Quarter for Charged Off and Fully Paid Loans",
x = "Quarter",
y = "DTI",
fill = "Loan Status") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
rm(df_filtered)
rm(df1)
rm(df_no_outliers)
rm(expected_counts)
rm(chi_sq_result)
The analysis indicates that higher DTI ratios are associated with charged-off loans, suggesting a potential risk factor for loan defaults. The observed trends in DTI medians may warrant further investigation to understand the underlying causes.
In this section, we use radar charts to compare key loan characteristics between loans issued in 2014 and 2018. By standardizing variables such as interest rates, annual income, FICO score range, and debt-to-income (DTI) ratio, we visualize changes in borrower profiles and loan terms over time. This analysis highlights shifts in loan affordability, borrower financial health, and creditworthiness, offering insights into how lending practices evolved during this period.
# RADAR CHART
# Step 1: Filter DataFrames for loans issued in 2014 and 2017
df_2014 <- df_final[df_final$issue_d >= as.Date("2015-01-01") & df_final$issue_d <= as.Date("2015-12-31"), ]
df_2017 <- df_final[df_final$issue_d >= as.Date("2018-01-01") & df_final$issue_d <= as.Date("2018-12-31"), ]
# Step 2: Function to remove outliers using IQR
# Function to remove outliers using IQR method
remove_outliers <- function(df, columns) {
for (col in columns) {
if (col %in% names(df)) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR_value <- IQR(df[[col]], na.rm = TRUE)
# Define lower and upper bounds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Filter out the outliers
df <- df[df[[col]] >= lower_bound & df[[col]] <= upper_bound, ]
} else {
message(paste("Column", col, "not found in the dataframe. Skipping."))
}
}
return(df)
}
# Step 3: Select the columns of interest for standardization and outlier removal
selected_columns <- c("int_rate", "annual_inc", "fico_range_low", "fico_range_high", "dti")
# Remove outliers in each group
df_2014 <- remove_outliers(df_2014, selected_columns)
df_2017 <- remove_outliers(df_2017, selected_columns)
# Step 4: Function to standardize numeric columns (scaling from 0 to 1)
# Convert df_2014 to a data.table
setDT(df_2014)
setDT(df_2017)
# Define the standardization function using data.table syntax
standardize_columns <- function(dt, selected_cols) {
dt[, (selected_cols) := lapply(.SD, function(x) (x - min(x, na.rm = TRUE)) /
(max(x, na.rm = TRUE) - min(x, na.rm = TRUE))),
.SDcols = selected_cols]
return(dt)
}
# Standardize the selected columns in each group
df_2014 <- standardize_columns(df_2014, selected_columns)
df_2017 <- standardize_columns(df_2017, selected_columns)
# Step 5: Calculate the medians for selected numeric columns
medians_2014 <- df_2014[, lapply(.SD, median, na.rm = TRUE), .SDcols = selected_columns]
medians_2017 <- df_2017[, lapply(.SD, median, na.rm = TRUE), .SDcols = selected_columns]
# Step 6: Combine the medians into a data frame for radar chart
radar_data_2014_2017 <- rbind(as.data.frame(medians_2014),
as.data.frame(medians_2017))
# Step 7: Assign appropriate row names for each category
rownames(radar_data_2014_2017) <- c("2014", "2017")
# Step 8: Add max and min rows for radar chart
radar_data_2014_2017 <- rbind(max = rep(1, ncol(radar_data_2014_2017)),
min = rep(0, ncol(radar_data_2014_2017)),
radar_data_2014_2017)
# Set the background color to white
par(bg = "white")
# Step 9: Plot the radar chart for 2014 and 2017
radarchart(radar_data_2014_2017,
axistype = 1, # Set axis type
pcol = c("#1E90FF", "#32CD32"), # Brighter line colors for 2014 and 2017
plty = 1, # Line type
title = "Comparison of Loan Variables (2014 vs. 2017)",
cglcol = "lightgrey", # Lighter grid line color
cglty = 1, # Type of the grid lines
caxislabels = seq(0, 1, 0.1), # Customize axis labels (0 to 1 range)
axislabcol = "black", # Axis label color
vlcex = 0.8, # Text size of labels
titlecol = "black", # Title color
cglwd = 0.8) # Thickness of the grid lines
# Step 10: Add legend to the radar chart
legend("topright", # Position of the legend
legend = c("2014", "2017"), # Labels for the legend
col = c("#1E90FF", "#32CD32"), # Corresponding bright colors
lty = 1, # Line type in the legend
bty = "n", # No box around the legend
text.col = "black", # Text color for the legend
cex = 0.8) # Text size of the legend
rm(df_2014)
rm(df_2017)
rm(medians_2014)
rm(medians_2017)
rm(radar_data_2014_2017)
This section examines the relationship between risk scores and debt-to-income (DTI) ratios for rejected loan applications, utilizing contour plots to visualize the distributions before and after 2015. By analyzing these trends, we aim to identify shifts in lending criteria and applicant profiles over time. The findings will provide insights into how lenders may have adjusted their risk assessment practices, reflecting broader economic changes and evolving borrower risk profiles.
# Contour PLOT
# Clone the original dataframe
dfr1 <- dfr
# Convert Application Date to Date format if not already done
dfr1$Application_Date <- as.Date(dfr1$'Application Date')
# Convert Risk_Score to numeric, handling NAs
dfr1$Risk_Score <- as.numeric(dfr1$Risk_Score)
# Convert Debt-To-Income Ratio to numeric by removing "%" and converting to decimal
dfr1$Debt_To_Income_Ratio <- as.numeric(gsub("%", "", dfr1$`Debt-To-Income Ratio`)) / 100
# Filter for application dates before and after 2015
dfr_before_2015 <- dfr1 %>% filter(Application_Date < as.Date("2015-01-01"))
dfr_after_2015 <- dfr1 %>% filter(Application_Date >= as.Date("2015-01-01"))
# Define outlier removal function using IQR
remove_outliers <- function(data) {
# Calculate Q1, Q3, and IQR for Risk_Score
Q1_risk <- quantile(data$Risk_Score, 0.25, na.rm = TRUE)
Q3_risk <- quantile(data$Risk_Score, 0.75, na.rm = TRUE)
IQR_risk <- IQR(data$Risk_Score, na.rm = TRUE)
# Calculate Q1, Q3, and IQR for Debt_To_Income_Ratio
Q1_dti <- quantile(data$Debt_To_Income_Ratio, 0.25, na.rm = TRUE)
Q3_dti <- quantile(data$Debt_To_Income_Ratio, 0.75, na.rm = TRUE)
IQR_dti <- IQR(data$Debt_To_Income_Ratio, na.rm = TRUE)
# Filter data to remove outliers
data_filtered <- data %>%
filter(
Risk_Score >= (Q1_risk - 1.5 * IQR_risk) & Risk_Score <= (Q3_risk + 1.5 * IQR_risk) &
Debt_To_Income_Ratio >= (Q1_dti - 1.5 * IQR_dti) & Debt_To_Income_Ratio <= (Q3_dti + 1.5 * IQR_dti)
)
return(data_filtered)
}
# Remove outliers from both datasets
dfr_before_2015_clean <- remove_outliers(dfr_before_2015)
dfr_after_2015_clean <- remove_outliers(dfr_after_2015)
# Combine the datasets for plotting
dfr_combined <- bind_rows(
dfr_before_2015_clean %>% mutate(Period = "Before 2015"),
dfr_after_2015_clean %>% mutate(Period = "After 2015")
)
# Create a combined contour plot
ggplot(dfr_combined, aes(x = Risk_Score, y = Debt_To_Income_Ratio, color = Period)) +
geom_density_2d() +
labs(title = "Contour Plot of Risk Score vs Debt-To-Income Ratio",
x = "Risk Score",
y = "Debt-To-Income Ratio") +
scale_color_manual(values = c("Before 2015" = "blue", "After 2015" = "red")) +
theme_minimal() +
theme(legend.title = element_blank())
# Clean up
rm(dfr_before_2015)
rm(dfr_after_2015)
rm(dfr1)
rm(dfr_before_2015_clean)
rm(dfr_after_2015_clean)
General Distribution:
The contour plot shows distinct areas of concentration of rejected loan
applications with varying risk scores and debt-to-income (DTI) ratios.
The highest concentration for both periods is around the DTI ratio of
0.2 to 0.3 and risk scores between 600 and 700, indicating that a
significant portion of rejections occurred in this range.
Before 2015 vs After 2015:
Debt-to-Income Ratio:
There is a noticeable shift in the red contour (After
2015) towards a higher DTI ratio (around 0.4 to 0.6) at various risk
levels, compared to blue contours (Before 2015). This
suggests that after 2015, rejected applicants tended to have higher debt
burdens relative to their income, potentially reflecting either more
lenient application behavior or tighter lending standards.
Risk Concentration Shifts:
Implication on Risk:
The increased spread in both the risk scores and DTI ratios for rejected
loans after 2015 may indicate changes in risk assessment and rejection
criteria. Lenders might have shifted their focus, possibly due to
economic changes post-2015, leading to the rejection of both higher-risk
and lower-risk applicants with varying debt profiles.
In this section, we conduct a histogram analysis of debt-to-income (DTI) ratios for rejected loans, focusing on data from before and after 2015. By visualizing the frequency distribution of DTI ratios, we aim to uncover trends in lending practices and the criteria used for loan rejection over time.
The histogram will reveal how the distribution of DTI values has shifted, highlighting any significant changes in rejection patterns associated with varying DTI levels. This analysis will provide insights into the evolving risk assessment strategies employed by lenders, particularly in response to economic conditions and regulatory changes post-2015.
# HISTOGRAM
# Load necessary libraries
library(ggplot2)
library(dplyr)
dfr1 <- dfr
# Ensure the Debt-To-Income Ratio is numeric
dfr1$dti <- as.numeric(gsub("%", "", dfr1$`Debt-To-Income Ratio`)) # Remove '%' and convert to numeric
# Ensure the Application Date is in Date format
dfr1$`Application Date` <- as.Date(dfr1$`Application Date`, format = "%Y-%m-%d")
# Remove NA values (if any) in Debt-To-Income Ratio and Application Date
dfr1 <- dfr1 %>% filter(!is.na(dti) & !is.na(`Application Date`))
# Define a function to remove outliers based on IQR
remove_outliers <- function(data) {
Q1 <- quantile(data$dti, 0.25, na.rm = TRUE) # 1st quartile (25th percentile)
Q3 <- quantile(data$dti, 0.75, na.rm = TRUE) # 3rd quartile (75th percentile)
IQR <- Q3 - Q1 # Interquartile range
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
data %>%
filter(dti >= lower_bound & dti <= upper_bound) # Remove outliers
}
# Apply the function to remove outliers
dfr_no_outliers <- remove_outliers(dfr1)
# Split the data into before and after 2015
dfr_before_2015 <- dfr_no_outliers %>% filter(`Application Date` < as.Date("2015-01-01"))
dfr_after_2015 <- dfr_no_outliers %>% filter(`Application Date` >= as.Date("2015-01-01"))
# Create the frequency distribution plot using ggplot
ggplot() +
geom_histogram(data = dfr_before_2015, aes(x = dti, y = ..count.., color = "Before 2015", fill = "Before 2015"),
binwidth = 1, alpha = 0.4, position = "identity") +
geom_histogram(data = dfr_after_2015, aes(x = dti, y = ..count.., color = "After 2015", fill = "After 2015"),
binwidth = 1, alpha = 0.4, position = "identity") +
labs(title = "Frequency Distribution of Debt-To-Income Ratio (Approved Loans)",
x = "Debt-To-Income Ratio (%)",
y = "Frequency",
color = "Period",
fill = "Period") +
theme_minimal() +
scale_color_manual(values = c("Before 2015" = "blue", "After 2015" = "red")) +
scale_fill_manual(values = c("Before 2015" = "blue", "After 2015" = "red")) +
theme(legend.position = "top")
# Optional: Clean up memory
rm(dfr1)
rm(dfr_no_outliers)
rm(dfr_before_2015)
rm(dfr_after_2015)
rm(df_no_outliers)
rm(dfr_combined)
rm(dfr1)
Do the loan grades provided to each customer correlate to their loan repayment behavior based on income and loan status?
df_final <- df %>%
mutate(loan_status_num = ifelse(loan_status == "Fully Paid", 1,
ifelse(loan_status == "Charged Off", 0, NA)))
In this section, we analyze the loan repayment rates categorized by loan grades using a bar plot. By transforming loan status into a binary variable—where “Fully Paid” is coded as 1 and “Charged Off” as 0—we can effectively calculate the repayment rate for each grade.
The bar plot will visually represent the correlation between loan grades and repayment rates, revealing trends that indicate how loan quality influences repayment behavior. This analysis is crucial for understanding the risk associated with different loan grades, highlighting significant disparities in repayment performance. The findings will underscore the importance of grading in assessing loan performance and guiding lending strategies.
# Assuming "Fully Paid" = 1 and "Charged Off" = 0
df_final <- df_final %>%
mutate(loan_status_binary = ifelse(loan_status == "Fully Paid", 1,
ifelse(loan_status == "Charged Off", 0, NA)))
# Calculate repayment rate by grade
repayment_rate <- df_final %>%
group_by(grade) %>%
summarise(repayment_rate = mean(loan_status_binary, na.rm = TRUE)) %>%
mutate(grade = factor(grade, levels = c("A", "B", "C", "D", "E", "F", "G"))) # Ensure correct order
# Create the enhanced bar plot
ggplot(repayment_rate, aes(x = grade, y = repayment_rate, fill = grade)) +
geom_bar(stat = "identity", width = 0.7) +
geom_text(aes(label = scales::percent(repayment_rate, accuracy = 0.1)),
vjust = -0.5, size = 4, fontface = "bold") +
scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
scale_fill_viridis_d() +
labs(title = "Loan Repayment Rate by Grade",
subtitle = "Higher grades show better repayment rates",
x = "Loan Grade",
y = "Repayment Rate") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12, face = "italic"),
axis.title = element_text(face = "bold"),
axis.text = element_text(size = 10),
legend.position = "none"
)
rm(repayment_rate)
Clear correlation between grade and repayment: There’s a strong positive relationship between loan grade and repayment rate. As the grade improves from G to A, the repayment rate steadily increases.
Significant gap between top and bottom grades: The highest grade (A) has a repayment rate of 94%, while the lowest grade (G) is at 49.2%—a difference of nearly 45 percentage points. This highlights the substantial risk difference between the best and worst-rated loans.
Steeper drops in lower grades: The decrease in repayment rate appears more pronounced between lower grades (e.g., from E to F to G) compared to higher grades. This suggests that risk increases more rapidly as loan quality declines below a certain threshold.
In this section, we analyze the distribution of annual income across different loan grades using a box plot. The box plot provides a visual representation of the income variability within each grade while also highlighting the median income. To ensure accuracy, we employ an outlier removal technique based on the interquartile range (IQR), which helps to focus the analysis on the central tendency of the income distribution. Our findings reveal a clear correlation between loan grade and income, with higher grades consistently associated with greater median incomes. Additionally, we observe that the income disparity narrows among lower grades, indicating that other factors may play a significant role in determining loan grades. This analysis not only emphasizes the importance of income in the lending process but also suggests that additional criteria may influence loan classification, particularly in the lower-grade categories.
# Function to remove outliers using IQR method
remove_outliers <- function(x) {
qnt <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
H <- 1.5 * IQR(x, na.rm = TRUE)
x[x < (qnt[1] - H) | x > (qnt[2] + H)] <- NA
return(x)
}
# Prepare the data and remove outliers
plot_data <- df_final %>%
filter(!is.na(annual_inc) & annual_inc > 0) %>%
mutate(grade = factor(grade, levels = c("A", "B", "C", "D", "E", "F", "G"))) %>%
group_by(grade) %>%
mutate(annual_inc_clean = remove_outliers(annual_inc)) %>%
filter(!is.na(annual_inc_clean)) %>%
ungroup()
# Calculate median income for annotation
median_incomes <- plot_data %>%
group_by(grade) %>%
summarise(median_income = median(annual_inc_clean))
# Create the enhanced boxplot without outliers
income_plot <- ggplot(plot_data, aes(x = grade, y = annual_inc_clean, fill = grade)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_text(data = median_incomes, aes(y = median_income, label = scales::dollar(median_income)),
vjust = -0.5, size = 3, fontface = "bold") +
scale_y_log10(labels = scales::dollar_format(),
breaks = scales::trans_breaks("log10", function(x) 10^x),
minor_breaks = NULL,
limits = c(1000, NA)) + # Set lower limit to reduce space
scale_fill_viridis_d(option = "D", begin = 0.3, end = 0.9) +
labs(title = "Income Distribution by Loan Grade",
x = "Loan Grade",
y = "Annual Income") +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(hjust = 0.5, size = 22, face = "bold", margin = margin(b = 10)),
axis.title = element_text(face = "bold", size = 16),
axis.text = element_text(size = 12),
legend.position = "none",
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "gray98", color = NA),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
) +
coord_cartesian(clip = "off") +
annotate("text", x = Inf, y = Inf, label = "BOXPLOT",
hjust = 1.1, vjust = 2, size = 4.5, fontface = "italic", color = "gray30")
# Display the plot (optional)
print(income_plot)
# Save the plot
ggsave("income_distribution_boxplot_no_outliers.png", plot = income_plot, width = 12, height = 8, dpi = 300)
rm(income_plot)
rm(median_incomes)
rm(plot_data)
Higher income for better grades: There’s a clear trend of higher median incomes for better loan grades. Grade A has the highest median income at $74,000, while lower grades (C through G) have lower median incomes around $60,000-$62,000.
Income disparity narrows in lower grades: The income distributions for grades C through G are very similar, with median incomes clustered closely together. This suggests that factors other than income may be more influential in determining these lower grades.
Wider income range for top grades: The boxplots for grades A and B appear to have larger interquartile ranges, indicating more income variability among borrowers in these categories. This could suggest that high-grade loans are accessible to a broader range of income levels, possibly due to other strong credit factors.
# Create income brackets
# Create income brackets
df_final <- df_final %>%
mutate(income_bracket = cut(annual_inc,
breaks = quantile(annual_inc, probs = seq(0, 1, 0.2), na.rm = TRUE),
labels = c('Very Low', 'Low', 'Medium', 'High', 'Very High'),
include.lowest = TRUE))
# Reorder income_bracket factor levels
df_final$income_bracket <- factor(df_final$income_bracket,
levels = c('Very High', 'High', 'Medium', 'Low', 'Very Low'))
# Calculate repayment rates
repayment_rates <- df_final %>%
group_by(grade, income_bracket) %>%
summarize(repayment_rate = mean(loan_status_binary, na.rm = TRUE)) %>%
pivot_wider(names_from = income_bracket, values_from = repayment_rate)
# Display the repayment rates
print(repayment_rates)
# A tibble: 7 × 6
# Groups: grade [7]
grade `Very High` High Medium Low `Very Low`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0.950 0.945 0.939 0.936 0.914
2 B 0.881 0.873 0.866 0.861 0.846
3 C 0.800 0.784 0.769 0.765 0.755
4 D 0.732 0.701 0.686 0.676 0.676
5 E 0.662 0.622 0.590 0.584 0.599
6 F 0.602 0.544 0.526 0.516 0.522
7 G 0.571 0.499 0.482 0.460 0.461
rm(heatmap_data)
In this section, we focus on preparing the data for a heatmap analysis that explores the relationship between income brackets and loan repayment rates across different loan grades. To facilitate this analysis, we first categorize annual income into distinct brackets using quantiles, which helps in understanding how income levels relate to repayment behavior. The income brackets include “Very Low,” “Low,” “Medium,” “High,” and “Very High,” and are structured to reflect varying financial capacities of borrowers.
Next, we calculate the repayment rates for each combination of loan grade and income bracket, providing insights into how repayment likelihood varies based on these factors. This transformation of the data into a wide format allows for an effective visualization of repayment rates in a heatmap, highlighting trends and patterns across different income segments and loan grades. The findings from this analysis can inform lending strategies and risk assessment, showcasing the intricate interplay between borrower income and loan performance.
# Prepare data for heatmap
heatmap_data <- repayment_rates %>%
pivot_longer(cols = -grade, names_to = "income_bracket", values_to = "repayment_rate")
# Ensure income_bracket is a factor with correct order
heatmap_data$income_bracket <- factor(heatmap_data$income_bracket,
levels = c('Very High', 'High', 'Medium', 'Low', 'Very Low'))
# Create heatmap
ggplot(heatmap_data, aes(x = income_bracket, y = grade, fill = repayment_rate)) +
geom_tile() +
geom_text(aes(label = scales::percent(repayment_rate, accuracy = 0.1)), color = "black") +
scale_fill_gradientn(colors = c("yellow", "green", "blue"),
labels = scales::percent_format(accuracy = 1),
name = "Repayment Rate") +
labs(title = "Repayment Rate by Grade and Income Bracket",
x = "Income Bracket",
y = "Loan Grade") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
rm(heatmap_data)
rm(repayment_rates)
Grade is the primary determinant of repayment: There’s a clear vertical gradient showing that loan grade has a stronger influence on repayment rates than income. Grade A loans consistently have the highest repayment rates (91-95%) across all income brackets, while Grade G loans have the lowest (46-57%).
Income impact varies by grade: For higher-grade loans (A-C), income has minimal effect on repayment rates. However, for lower-grade loans (D-G), there’s a more noticeable positive correlation between income and repayment rates, particularly for the “Very High” income bracket.
Surprising trend in lowest grades: Counterintuitively, for the lowest grades (F-G), the “Very High” income bracket shows higher repayment rates than some middle income brackets. This suggests that high-income borrowers with poor credit scores might be better risks than their grades suggest.
This section presents the results of an ANOVA test aimed at assessing whether there are significant differences in annual income across various loan grades. The test revealed an F-statistic of 1260.67 and a p-value of 0.0000, indicating highly statistically significant differences in income levels across the grades. These findings confirm the strong relationship between income and loan grades, supporting earlier visual observations that higher grades are associated with higher incomes. Overall, the results suggest that income is a key factor in determining loan grades.
# Perform ANOVA to test for income differences across grades
anova_result <- aov(annual_inc ~ grade, data = df_final)
# Display the ANOVA summary
anova_summary <- summary(anova_result)
# Print the ANOVA results
cat("ANOVA test results for income differences across grades:\n")
ANOVA test results for income differences across grades:
cat(sprintf("F-statistic = %.2f, p-value = %.4f\n",
anova_summary[[1]]$`F value`[1],
anova_summary[[1]]$`Pr(>F)`[1]))
F-statistic = 1260.67, p-value = 0.0000
rm(anova_result)
rm(anova_summary)
F-statistic: 1260.67, p-value: 0.0000
Statistically significant difference: The extremely low p-value (0.0000) indicates there are highly statistically significant differences in income levels across loan grades. This suggests that income is a meaningful factor in determining loan grades.
Strong relationship: The large F-statistic (1260.67) indicates a strong relationship between loan grades and income levels. This suggests that income variations explain a substantial portion of the differences between loan grades.
Confirmation of visual trends: These results statistically confirm the visual trends observed in previous charts, where higher loan grades were associated with higher incomes. The ANOVA provides robust evidence that these income differences across grades are not due to chance.
In this section, we conduct a Chi-Square test to evaluate the relationship between loan grades and loan status outcomes. The results indicate a highly significant association between these two variables, with a p-value of 0.0000. Specifically, the analysis reveals that Grade A loans have a higher-than-expected number of “Fully Paid” outcomes, while lower grades (E, F, and G) exhibit a concerning trend with increased “Charged Off” and “Late” payments. These findings suggest that loan grade is a strong predictor of loan performance, reinforcing the notion that higher grades correspond to lower risk in loan repayments.
# Create a contingency table
contingency_table <- table(df_final$grade, df_final$loan_status)
# Perform chi-square test
chi_square_result <- chisq.test(contingency_table)
# Extract results
chi2 <- chi_square_result$statistic
p_value <- chi_square_result$p.value
dof <- chi_square_result$parameter
# Print results
cat(sprintf("Chi-square test results: chi2 = %.2f, p-value = %.4f\n",
chi2, p_value))
Chi-square test results: chi2 = 123374.94, p-value = 0.0000
cat(sprintf("Degrees of freedom: %d\n", dof))
Degrees of freedom: 36
# Print contingency table
cat("\nContingency table:\n")
Contingency table:
print(contingency_table)
Charged Off Current Default Fully Paid In Grace Period Late (16-30 days)
A 12767 195539 5 200432 632 341
B 48471 262903 7 311162 1771 993
C 81750 255707 8 277973 2993 1476
D 57983 115138 13 129402 1899 969
E 34228 38080 6 53028 799 402
F 13654 8485 1 15896 240 129
G 4342 2465 0 4197 102 39
Late (31-120 days)
A 1322
B 4788
C 7554
D 4761
E 2166
F 640
G 236
# Print expected frequencies
cat("\nExpected frequencies:\n")
Expected frequencies:
print(chi_square_result$expected)
Charged Off Current Default Fully Paid In Grace Period Late (16-30 days)
A 48229 167303 7.619 188974 1606.9 828.4
B 73932 256464 11.680 289686 2463.3 1269.9
C 73623 255392 11.631 288475 2453.0 1264.6
D 36393 126245 5.749 142598 1212.5 625.1
E 15102 52388 2.386 59174 503.2 259.4
F 4581 15892 0.724 17951 152.6 78.7
G 1335 4632 0.211 5232 44.5 22.9
Late (31-120 days)
A 4089
B 6268
C 6242
D 3086
E 1280
F 388
G 113
rm(chi_square_result)
Highly significant relationship: The extremely low p-value (0.0000) indicates a statistically significant relationship between loan grades and loan outcomes. This suggests that loan grade is a strong predictor of loan performance.
Grade A outperforms expectations: Comparing observed to expected frequencies, Grade A loans have far more “Fully Paid” outcomes (200,432 vs 188,974 expected) and fewer “Charged Off” outcomes (12,767 vs 48,229 expected) than expected. This underscores the lower risk associated with top-grade loans.
Lower grades struggle more: Grades E, F, and G show higher frequencies of “Charged Off” and “Late” payments compared to their expected values. This aligns with the higher risk profile of lower-grade loans and supports the trends seen in previous analyses.
# Interpretation
cat("\nInterpretation:\n")
Interpretation:
if (p_value < 0.05) {
cat("The chi-square test shows a statistically significant association between Grade and Loan Status (p < 0.05).")
cat("\nThis suggests that the loan grade is not independent of the loan status.")
} else {
cat("The chi-square test does not show a statistically significant association between Grade and Loan Status (p >= 0.05).")
cat("\nThis suggests that the loan grade might be independent of the loan status.")
}
The chi-square test shows a statistically significant association between Grade and Loan Status (p < 0.05).
This suggests that the loan grade is not independent of the loan status.
# Effect size (Cramer's V)
n <- sum(contingency_table)
min_dim <- min(dim(contingency_table)) - 1
cramer_v <- sqrt(chi2 / (n * min_dim))
cat(sprintf("\n\nEffect size (Cramer's V): %.4f\n", cramer_v))
Effect size (Cramer's V): 0.0976
cat("Interpretation of Cramer's V:\n")
Interpretation of Cramer's V:
if (cramer_v < 0.1) {
cat("Negligible association")
} else if (cramer_v < 0.3) {
cat("Weak association")
} else if (cramer_v < 0.5) {
cat("Moderate association")
} else {
cat("Strong association")
}
Negligible association
This section presents a stacked bar chart illustrating the distribution of loan statuses by grade. The visualization highlights a clear inverse relationship between loan grade and charge-off rates, with Grade A loans exhibiting the lowest charge-off proportions and Grade G loans the highest. Additionally, higher grades (A and B) show a significantly greater proportion of fully paid loans, emphasizing their lower risk profile. Conversely, as loan grades decrease, the proportion of loans in “Current” status increases, suggesting potential borrower behavior trends and the impact of loan terms on repayment patterns.
#3. Loan Status Distribution by Grade
df_final %>%
ggplot(aes(x = grade, fill = loan_status)) +
geom_bar(position = "fill") +
labs(title = "Loan Status Distribution by Grade",
x = "Loan Grade", y = "Proportion",
fill = "Loan Status") +
theme_minimal() +
scale_y_continuous(labels = scales::percent_format())
rm(df_final)
Inverse relationship between grade and charge-offs: There’s a clear trend of increasing charge-off rates as the loan grade decreases. Grade A loans have the lowest proportion of charge-offs, while Grade G has the highest. This reinforces the higher risk associated with lower-grade loans.
Fully paid loans dominate higher grades: Grades A and B show a significantly higher proportion of fully paid loans compared to lower grades. This indicates that higher-grade loans are more likely to be repaid in full, aligning with their lower risk profile.
Current loans increase in lower grades: The proportion of loans in “Current” status increases noticeably from Grade C to G. This could suggest that lower-grade loans have longer terms or that borrowers of lower-grade loans are more likely to make minimum payments rather than paying off the loan early.
Is there a relationship between employment length and the amount requested for rejected loans and how does it compare to the customers that had accepted loan.
This analysis explores the relationship between employment length and the amount requested for both rejected and accepted loans. By comparing these two groups, we aim to understand how employment stability may influence loan-seeking behavior and the associated loan amounts. This comparison will help identify potential trends and differences that could inform lending practices and risk assessments for various employment profiles.
In this section, we focus on cleaning the data for both accepted and rejected loans to ensure accurate analysis. Specifically, we will standardize the “Employment Length” and “Amount Requested” fields. For the employment length, we convert various formats into a consistent numeric representation, replacing values like “< 1 year” with 0.5 and “10+ years” with 10. We will also remove any unnecessary text for clarity. Additionally, we will clean the loan amount fields to confirm they are numeric, stripping away any non-numeric characters. This cleaning process is essential to facilitate meaningful comparisons between the two loan groups and to analyze their relationship with the requested amounts effectively.
accepted_loans <- df
rejected_loans <- dfr
rm(df)
rm(dfr)
unique(rejected_loans$"Employment Length")
[1] "< 1 year" "5 years" "10+ years" "" "1 year" "2 years"
[7] "3 years" "6 years" "4 years" "7 years" "8 years" "9 years"
unique(accepted_loans$emp_length)
[1] "10+ years" "3 years" "4 years" "6 years" "1 year" "7 years"
[7] "8 years" "5 years" "2 years" "9 years" "< 1 year" ""
# copy of the "Employment Length" column for rejected loans and emp_length for accepted loans
rejected_loans$emp_length_cleaned <- rejected_loans$"Employment Length"
accepted_loans$emp_length_cleaned <- accepted_loans$emp_length
# Replace "< 1 year" with 0.5, "10+ years" with 10, and remove "years"
rejected_loans$emp_length_cleaned <- gsub("< 1", "0.5", rejected_loans$emp_length_cleaned)
rejected_loans$emp_length_cleaned <- gsub("10\\+ years", "10", rejected_loans$emp_length_cleaned)
rejected_loans$emp_length_cleaned <- gsub(" years", "", rejected_loans$emp_length_cleaned)
accepted_loans$emp_length_cleaned <- gsub("< 1", "0.5", accepted_loans$emp_length_cleaned)
accepted_loans$emp_length_cleaned <- gsub("10\\+ years", "10", accepted_loans$emp_length_cleaned)
accepted_loans$emp_length_cleaned <- gsub(" years", "", accepted_loans$emp_length_cleaned)
# Convert to numeric
rejected_loans$emp_length_cleaned <- as.numeric(rejected_loans$emp_length_cleaned)
accepted_loans$emp_length_cleaned <- as.numeric(accepted_loans$emp_length_cleaned)
# Cleaning loan amount in accepted loans
accepted_loans$loan_amnt_cleaned <- accepted_loans$loan_amnt
# Ensure it is numeric (remove any non-numeric characters just in case)
accepted_loans$loan_amnt_cleaned <- as.numeric(gsub("[^0-9.]", "", accepted_loans$loan_amnt_cleaned))
# Check if any NA values were introduced during the conversion
sum(is.na(accepted_loans$loan_amnt_cleaned))
[1] 0
summary(accepted_loans$loan_amnt_cleaned)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1000 8000 13000 15163 20000 40000
# Cleaning amount.requested in rejected loans
rejected_loans$amount_requested_cleaned <- rejected_loans$"Amount Requested"
# Ensure it is numeric (remove any non-numeric characters)
rejected_loans$amount_requested_cleaned <- as.numeric(gsub("[^0-9.]", "", rejected_loans$amount_requested_cleaned))
This section presents a boxplot analysis comparing employment length with loan amounts for accepted loans.
# Boxplot for accepted loans: Employment Length vs Loan Amount
ggplot(accepted_loans, aes(x = as.factor(emp_length_cleaned), y = loan_amnt_cleaned)) +
geom_boxplot(fill = "lightgreen", color = "blue") +
ggtitle("Accepted Loans: Employment Length vs Loan Amount") +
xlab("Employment Length (years)") +
ylab("Loan Amount") +
theme_minimal()
The median loan amount is between 10.000 USD to 20.000 USD for the accepted loan dataset. Their loan had been accepted at this rate without much change in amount throughout the different employment length category.
In this section, we focus on the rejected loan dataset, which contains approximately 117,000 observations out of a total of 22 million. This accounts for less than 0.5% of the data pertaining to loan requests, with amounts reaching up to 1 million USD. To ensure a more accurate analysis, we have removed outliers that exhibited excessive deviation from the overall dataset. This cleaning process will help us better understand the relationship between employment length and the amount requested for rejected loans.
##Working on the rejected loan set, cleaning the outliers.
# Calculate Q1, Q3, and IQR for amount requested
Q1 <- quantile(rejected_loans$amount_requested_cleaned, 0.25, na.rm = TRUE)
Q3 <- quantile(rejected_loans$amount_requested_cleaned, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
# Define lower and upper bound
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Filter out outliers
rejected_loans_filtered <- rejected_loans %>%
filter(amount_requested_cleaned >= lower_bound & amount_requested_cleaned <= upper_bound)
Rejected dataset has 117000 observations out of 22 million observation data which is less than 0.5% of the data in loan requested amount ranging into 1 million USD loan request. We’ve removed the outliers that has extensive reach and deviation from the dataset.
This section presents a boxplot analysis of the rejected loans dataset, focusing on the relationship between employment length and the amount requested after removing outliers.
# Boxplot without outliers
ggplot(rejected_loans_filtered, aes(x = as.factor(emp_length_cleaned), y = amount_requested_cleaned)) +
geom_boxplot(fill = "lightcoral", color = "black") +
ggtitle("Rejected Loans: Employment Length vs Amount Requested (Outliers Removed)") +
xlab("Employment Length (years)") +
ylab("Amount Requested") +
theme_minimal()
The rejected dataset processing shows that amount requested had good amount of variation for the each employment category. Loan amounts ranged from less than 5000 USD to around 25.000USD. In order to understand the frequency of the loan amount requested we can create density plot for both accepted and rejected loan dataset. That way we can determined at what frequency how much had been requested or granted to the applicants.
This section explores the distribution of requested amounts in the rejected loans dataset through a density plot. Out of approximately 22 million observations, around 4 million applicants requested $10,000, making it the most common loan amount sought. This is followed by requests for $5,000, $1,000, $20,000, and $15,000.
We also compare this with the distribution of loan amounts granted in the accepted loans dataset. The density plot for accepted loans indicates that the $10,000 loan package was the most frequently granted amount, followed by $20,000 and $15,000. This analysis helps illustrate the contrast in requested and granted loan amounts across both datasets.
ggplot(rejected_loans_filtered, aes(x = amount_requested_cleaned)) +
geom_histogram(binwidth = 1000, fill = "lightcoral", color = "black") +
ggtitle("Distribution of Amount Requested (Rejected Loans)") +
xlab("Amount Requested") +
ylab("Frequency")
rm(rejected_loans_filtered)
In the rejected loan ,out of 22 mln observations about 4 mln applicants requested 10000 USD this is the most requested amount, 2nd most requested amount is 5000 USD followed by 1000 USD, 20000USD and 15000USD.
# Plot the distribution of loan amounts for accepted loans
ggplot(accepted_loans, aes(x = loan_amnt_cleaned)) +
geom_histogram(binwidth = 1000, fill = "lightgreen", color = "black") +
ggtitle("Distribution of Loan Amount (Accepted Loans)") +
xlab("Loan Amount") +
ylab("Frequency") +
theme_minimal()
In the accapted loan, we can see 10000USD loan package was the most granted, followed by 20000USD and 15000USD.
This section presents a pie chart visualizing the distribution of employment lengths among applicants whose loan requests were rejected. The analysis focuses on applicants with employment lengths between 1 and 10 years. The chart reveals that 69% of rejected loans were associated with applicants who had 5 years of employment. Additionally, 11.5% of the rejected applications came from individuals with 10 or more years of employment. This visualization highlights the relationship between employment length and loan rejection rates, providing insights into the employment profiles of applicants in the rejected loans dataset.
# Filter employment lengths between 1 and 10 years
emp_length_filtered <- rejected_loans %>%
filter(emp_length_cleaned >= 1 & emp_length_cleaned <= 10)
# Calculate the percentage of each employment length (1 to 10 years)
emp_length_counts <- table(emp_length_filtered$emp_length_cleaned)
emp_length_percentage <- prop.table(emp_length_counts) * 100
rm(emp_length_filtered)
rm(emp_length_counts)
# Convert to a data frame for easy plotting
emp_length_df <- as.data.frame(emp_length_percentage)
rm(emp_length_percentage)
colnames(emp_length_df) <- c("Employment_Length", "Percentage")
# Plot the data using a pie chart
ggplot(emp_length_df, aes(x = "", y = Percentage, fill = factor(Employment_Length))) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), position = position_stack(vjust = 0.5)) +
ggtitle("Rejected Loans % by Employment Length") +
theme_minimal() +
scale_fill_brewer(palette = "Set3") + # Colorful palette for the pie chart
labs(fill = "Years")
Employment length pie chart is designed to showcase and visualize the dataset. From our graphic we can see that 69% of the rejected loan were at 5 year employment length. And 11.5% of the dataset is showing employment length of 10 years and more as the second highest category to request loan and get rejected by the employment category.
This section presents a pie chart visualizing the distribution of employment lengths among applicants whose loan requests were accepted. The analysis focuses on applicants with employment lengths between 1 and 10 years. Notably, 42.6% of accepted loan applications came from individuals with 10 or more years of employment, indicating a significant correlation between longer employment duration and loan acceptance. In contrast, this category represented just over 11% of the rejected applications. This suggests that loan companies may be more inclined to approve requests from applicants with extensive employment histories, reflecting a potential preference for stability and reliability in the applicant’s profile.
# Filter employment lengths between 1 and 10 years for accepted loans
emp_length_filtered_accepted <- accepted_loans %>%
filter(emp_length_cleaned >= 1 & emp_length_cleaned <= 10)
# Calculate the percentage of each employment length (1 to 10 years)
emp_length_counts_accepted <- table(emp_length_filtered_accepted$emp_length_cleaned)
rm(emp_length_filtered_accepted)
emp_length_percentage_accepted <- prop.table(emp_length_counts_accepted) * 100
rm(emp_length_counts_accepted)
# Convert to a data frame for easy plotting
emp_length_df_accepted <- as.data.frame(emp_length_percentage_accepted)
colnames(emp_length_df_accepted) <- c("Employment_Length", "Percentage")
rm(emp_length_percentage_accepted)
# Plot the data using a pie chart
ggplot(emp_length_df_accepted, aes(x = "", y = Percentage, fill = factor(Employment_Length))) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(round(Percentage, 1), "%")), position = position_stack(vjust = 0.5)) +
ggtitle("Accepted Loans % by Employment Length") +
theme_minimal() +
scale_fill_brewer(palette = "Set3") + # Use a colorful palette for better visual appeal
labs(fill = "Years")
rm(emp_length_counts_accepted)
rm(emp_length_df_accepted)
Whereas, the accepted loan amount for the category 10 years or more of employment length had covered 42.6% of the accepted loan application. It could be interpreted in many ways. However, now we know that 10 yrs employment category group had applied about the same amount of loan, covered only bit more than 11% of applicant in the rejected category. It can be hypothesized that the loan company had tendency to grant more of their loan requests.
# Function to clean numeric columns
clean_numeric <- function(data, cols, percent_cols = NULL) {
# Convert specified columns to numeric
for (col in cols) {
data[[col]] <- as.numeric(data[[col]])
}
# If there are percentage columns, clean and convert them as well
if (!is.null(percent_cols)) {
for (col in percent_cols) {
data[[col]] <- as.numeric(gsub("%", "", data[[col]])) / 100
}
}
return(data)
}
# Columns to clean for accepted loans
accepted_columns <- c("emp_length_cleaned", "dti", "loan_amnt")
# Columns to clean for rejected loans (including percentage)
rejected_columns <- c("emp_length_cleaned", "Amount Requested")
percent_columns <- c("Debt-To-Income Ratio")
# Clean the datasets
accepted_loans <- clean_numeric(accepted_loans, accepted_columns)
rejected_loans <- clean_numeric(rejected_loans, rejected_columns, percent_columns)
We have asked ourselves how the employment length is influencing the applicant’s change of being granted the loan and we’ve analyzed how much percentage of the loan was granted to each employment length group ranging from less than 1 year and 10 or more years. We can now run statistical tests to see how confident or relevant this could be in such big amount of dataset.
# 1. ANOVA: Employment Length vs Loan Grade
anova_grade <- aov(emp_length_cleaned ~ grade, data = accepted_loans)
anova_summary <- summary(anova_grade)
# 2. Pearson Correlation: Employment Length vs Annual Income
cor_emp_income <- cor.test(accepted_loans$emp_length_cleaned, accepted_loans$annual_inc, use = "complete.obs")
# 3. Pearson Correlation: Employment Length vs DTI
cor_emp_dti <- cor.test(accepted_loans$emp_length_cleaned, accepted_loans$dti, use = "complete.obs")
# Create a summary table for results
test_results <- data.frame(
Test = c("ANOVA: Employment Length vs Loan Grade",
"Correlation: Employment Length vs Annual Income",
"Correlation: Employment Length vs DTI"),
Statistic = c(anova_summary[[1]][["F value"]][1],
cor_emp_income$estimate,
cor_emp_dti$estimate),
P_Value = c(anova_summary[[1]][["Pr(>F)"]][1],
cor_emp_income$p.value,
cor_emp_dti$p.value)
)
# Print the summary table
print(test_results)
Test Statistic P_Value
1 ANOVA: Employment Length vs Loan Grade 47.2839 2.56e-58
2 Correlation: Employment Length vs Annual Income 0.0349 0.00e+00
3 Correlation: Employment Length vs DTI 0.0243 8.97e-219
The results of your ANOVA test for the relationship between employment length and loan grade are as follows: - F-statistic: 61.35 - P-value: 2.01e-76
Explanation:
Conclusion:
# Load necessary libraries
library(dplyr)
library(ggplot2)
# Data Preparation: Ensure all variables are numeric
rejected_loans$emp_length_cleaned <- as.numeric(rejected_loans$emp_length_cleaned)
rejected_loans$amount_requested_cleaned <- as.numeric(rejected_loans$"Amount Requested")
rejected_loans$Debt.To.Income.Ratio_numeric <- as.numeric(gsub("%", "", rejected_loans$"Debt-To-Income Ratio")) / 100
rejected_loans$Risk_Score <- as.numeric(rejected_loans$Risk_Score)
# Remove rows with missing values
rejected_loans_cleaned <- rejected_loans %>%
filter(!is.na(emp_length_cleaned) &
!is.na(amount_requested_cleaned) &
!is.na(Debt.To.Income.Ratio_numeric) &
!is.na(Risk_Score))
# 1. Pearson Correlation: Employment Length vs Amount Requested
cor_emp_amount <- cor.test(rejected_loans_cleaned$emp_length_cleaned, rejected_loans_cleaned$amount_requested_cleaned, use = "complete.obs")
# 2. Pearson Correlation: Employment Length vs DTI
cor_emp_dti <- cor.test(rejected_loans_cleaned$emp_length_cleaned, rejected_loans_cleaned$Debt.To.Income.Ratio_numeric, use = "complete.obs")
# 3. Linear Regression: Employment Length vs Risk Score
reg_emp_risk <- lm(Risk_Score ~ emp_length_cleaned, data = rejected_loans_cleaned)
summary_reg_emp_risk <- summary(reg_emp_risk)
# Prepare the results for a table output
test_results <- data.frame(
Test = c("Correlation: Employment Length vs Amount Requested",
"Correlation: Employment Length vs DTI",
"Linear Regression: Employment Length vs Risk Score"),
Statistic = c(round(cor_emp_amount$estimate, 2),
round(cor_emp_dti$estimate, 2),
round(summary_reg_emp_risk$coefficients[2, "t value"], 2)),
P_Value = c(round(cor_emp_amount$p.value, 2),
round(cor_emp_dti$p.value, 2),
round(summary_reg_emp_risk$coefficients[2, "Pr(>|t|)"], 2))
)
# Print the results
print(test_results)
Test Statistic P_Value
1 Correlation: Employment Length vs Amount Requested 0.12 0.00
2 Correlation: Employment Length vs DTI 0.00 0.87
3 Linear Regression: Employment Length vs Risk Score 32.85 0.00
Based on the statistical results for the rejected loan dataset, we can say following:
Correlation: Employment Length vs Amount Requested - Correlation Coefficient: 0.12 - P-Value: 0.00
Explanation: - The correlation coefficient of 0.12 suggests a weak positive correlation between employment length and the amount requested. This means that there is a slight tendency for people with longer employment histories to request higher loan amounts, but the relationship is not strong. - The p-value of 0.00 indicates that this result is statistically significant. Since the p-value is less than 0.05, it means that the observed correlation is unlikely to have occurred by chance. Therefore, even though the correlation is weak, it is a meaningful relationship.
Correlation: Employment Length vs DTI - Correlation Coefficient: 0.00 - P-Value: 0.39
Explanation: - The correlation coefficient of 0.00 suggests that there is no linear relationship between employment length and debt-to-income ratio (DTI). In other words, the length of employment does not seem to have any meaningful impact on the debt-to-income ratio in the rejected loan dataset. - The p-value of 0.39 is much higher than 0.05, meaning the result is not statistically significant. This indicates that any observed relationship between these variables is likely due to chance, and there is no meaningful connection between employment length and DTI in this dataset.
Linear Regression: Employment Length vs Risk Score - T-statistic: 74.87 - P-Value: 0.00
Explanation: - The t-statistic of 74.87 is quite large, which suggests that employment length has a strong effect on Risk Score in the regression model. - The p-value of 0.00 indicates that this relationship is statistically significant. Since the p-value is less than 0.05, we can confidently say that there is a meaningful relationship between employment length and risk score. - However, note that while the relationship is statistically significant, this does not automatically imply a strong relationship. The actual strength of the relationship can be further interpreted by examining the regression coefficients (we had not done that analysis)
In both the accepted and rejected loan datasets, employment length shows a statistically significant relationship with loan outcomes, though the strength varies. In the accepted dataset, employment length significantly differs across loan grades, indicating that longer employment histories may be associated with higher loan grades. In the rejected dataset, while employment length has a weak but significant positive correlation with the amount requested and a strong impact on risk score, it shows no meaningful relationship with debt-to-income ratio. Since we’ve have worked with big dataset, we can say that having longer years of employment history you will have higher change of getting accepted the loan. Most of the applicants with long work history had good risk score and had higher loan grades making them preferred candidates from Lending agency’s perspective.
Interest rates and term are the strongest predictors of loan charge offs, while loan amount does not significantly affect loan status. The analysis highlights significant shifts in lending practices and loan applicant profiles before and after 2015. A more conservative lending approach emerged post-2015, marked by stricter criteria for risk scores and a preference for applicants with lower debt-to-income ratios and higher creditworthiness. Loan grades strongly correlate with repayment behavior, with higher grades consistently showing better repayment rates. Income levels play a secondary role. The grading system appears to be effective in predicting loan outcomes. In the accepted dataset, employment length significantly varies by loan grade, suggesting longer employment is linked to higher grades. In the rejected dataset, employment length has a weak but significant correlation with amount requested and a strong, significant impact on risk score, but shows no significant relationship with debt-to-income ratio.
Words For The Wise. (n.d.). Lending Club dataset. Kaggle. https://www.kaggle.com/datasets/wordsforthewise/lending-club
Lending Club. (n.d.). Founder savings. Retrieved October 21, 2024, from https://www.lendingclub.com/personal-savings/founder-savings
Dosei, I. (n.d.). Lending Club Loan Data: Data Dictionary. GitHub. https://github.com/dosei1/Lending-Club-Loan-Data/blob/master/LCDataDictionary.csv
Nayak, A. (2020, July 14). Bank loan case study. Medium. https://medium.com/@avi22nayak/bank-loan-case-study-d07524590992