df <- fread("../DataSet/cleaned_accepted_2013_to_2018Q4.csv")
col_names = c("Debt-To-Income Ratio", "Risk_Score","Application Date","Employment Length","Amount Requested")
dfr <- fread("../DataSet/filtered_rejected_2013_to_2018Q.csv", select = col_names)
#dfr <- fetch_subset(df_location = "../DataSet/filtered_rejected_2013_to_2018Q.csv", col_names = col_names, nrows = c(1,-1))
set.seed(1)  # For reproducibility
dfr <- dfr %>% sample_frac(0.20)

1 Overview

Since 2006, LendingClub has been a U.S. financial services company that has facilitated loan contracts. Our research will focus on LendingClub’s loan data that consist of approved and rejected loans between 2013-2018. To understand LendingClub’s clientele, we will analyze borrower and loan characteristics to assess risk levels, determine loan acceptance, and understand what factors influence whether accepted loans will be fully repaid. To accomplish this, we will ask the following SMART questions:

How do the loan features of term, amount, and interest rate individually influence the likelihood of a loan being charged off (lender has accepted the loan as unlikely to recover) versus being fully paid?
How do variables such as dti, annual_inc, int_rate, and fico_range impact the likelihood of a loan being “fully paid” or “charged off” year over year from 2014 to 2016? What trends can be identified in the data, and how do these variables differ between the two loan statuses over time?
Do the loan grades provided to each customer correlate to their loan repayment behavior based on income and loan status?
Is there a relationship between employment length and the amount requested for rejected loans and how does it compare to the customers that had accepted loans?

2 Data Dictionary

Funded Amount: The total amount of the loan to be paid off.
Loan Status: Whether the loan has been fully paid or has been charged off. A loan that is fully paid is defined as a loan where the total payment has been met. A charge off status is defined as the lender accepting the loan as uncollectible. This differs from the default status in which payments are late but the lender expects to collect what the borrower still owes.
Term: How long the borrower has to pay off the loan (Either 36 or 60 Months).
Interest Rate: The percentage of the loan amount that the lender charges the borrower for borrowing money.
DTI (Debt to Income Ratio): A measure used to evaluate a borrower’s financial situation by comparing their debt payments to their gross monthly income.
Annual Income: The dollar amount the customer earns on a yearly basis.
FICO Range: A measure of creditworthiness used by lenders to assess a borrower’s ability to repay a loan.
Loan Grades: Classifies the risk level of a loan from A (less risky) to G (risky).
Employment Length: The amount in years the customer has been employed.

3 About the Datasets

The datasets used in this research are collected from the Lending Club loan data, which is available on Kaggle: Lending Club Loan Data. The data contains information about both accepted and rejected loan applications from 2013 to 2018.

3.1 Accepted Dataset:

print("Accepted Dataset:")

[1] "Accepted Dataset:"

str(df)  # Displaying structure of accepted dataset

Classes 'data.table' and 'data.frame':  2157894 obs. of  110 variables:
 $ id                            : int  68407277 68355089 68341763 66310712 68476807 68426831 68476668 67275481 68466926 68616873 ...
 $ loan_amnt                     : num  3600 24700 20000 35000 10400 ...
 $ funded_amnt                   : num  3600 24700 20000 35000 10400 ...
 $ funded_amnt_inv               : num  3600 24700 20000 35000 10400 ...
 $ term                          : chr  "36 months" "36 months" "60 months" "60 months" ...
 $ int_rate                      : num  14 12 10.8 14.8 22.4 ...
 $ installment                   : num  123 820 433 830 290 ...
 $ grade                         : chr  "C" "C" "B" "C" ...
 $ sub_grade                     : chr  "C4" "C1" "B4" "C5" ...
 $ emp_title                     : chr  "leadman" "Engineer" "truck driver" "Information Systems Officer" ...
 $ emp_length                    : chr  "10+ years" "10+ years" "10+ years" "10+ years" ...
 $ home_ownership                : chr  "MORTGAGE" "MORTGAGE" "MORTGAGE" "MORTGAGE" ...
 $ annual_inc                    : num  55000 65000 63000 110000 104433 ...
 $ verification_status           : chr  "Not Verified" "Not Verified" "Not Verified" "Source Verified" ...
 $ issue_d                       : IDate, format: "2015-12-01" "2015-12-01" ...
 $ loan_status                   : chr  "Fully Paid" "Fully Paid" "Fully Paid" "Current" ...
 $ pymnt_plan                    : chr  "n" "n" "n" "n" ...
 $ url                           : chr  "https://lendingclub.com/browse/loanDetail.action?loan_id=68407277" "https://lendingclub.com/browse/loanDetail.action?loan_id=68355089" "https://lendingclub.com/browse/loanDetail.action?loan_id=68341763" "https://lendingclub.com/browse/loanDetail.action?loan_id=66310712" ...
 $ purpose                       : chr  "debt_consolidation" "small_business" "home_improvement" "debt_consolidation" ...
 $ title                         : chr  "Debt consolidation" "Business" "" "Debt consolidation" ...
 $ zip_code                      : chr  "190xx" "577xx" "605xx" "076xx" ...
 $ addr_state                    : chr  "PA" "SD" "IL" "NJ" ...
 $ dti                           : num  5.91 16.06 10.78 17.06 25.37 ...
 $ delinq_2yrs                   : num  0 1 0 0 1 0 0 1 0 0 ...
 $ earliest_cr_line              : chr  "Aug-2003" "Dec-1999" "Aug-2000" "Sep-2008" ...
 $ fico_range_low                : num  675 715 695 785 695 690 680 705 685 700 ...
 $ fico_range_high               : num  679 719 699 789 699 694 684 709 689 704 ...
 $ inq_last_6mths                : num  1 4 0 0 3 0 0 0 1 0 ...
 $ mths_since_last_delinq        : num  30 6 NA NA 12 NA 49 3 NA 75 ...
 $ open_acc                      : num  7 22 6 13 12 5 12 8 14 8 ...
 $ pub_rec                       : num  0 0 0 0 0 0 0 0 1 0 ...
 $ revol_bal                     : num  2765 21470 7869 7802 21929 ...
 $ revol_util                    : num  29.7 19.2 56.2 11.6 64.5 68.4 84.5 5.7 34.5 39.1 ...
 $ total_acc                     : num  13 38 18 17 35 6 27 15 23 18 ...
 $ initial_list_status           : chr  "w" "w" "w" "w" ...
 $ out_prncp                     : num  0 0 0 15898 0 ...
 $ out_prncp_inv                 : num  0 0 0 15898 0 ...
 $ total_pymnt                   : num  4422 25680 22706 31464 11740 ...
 $ total_pymnt_inv               : num  4422 25680 22706 31464 11740 ...
 $ total_rec_prncp               : num  3600 24700 20000 19102 10400 ...
 $ total_rec_int                 : num  822 980 2706 12362 1340 ...
 $ total_rec_late_fee            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ recoveries                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ collection_recovery_fee       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ last_pymnt_d                  : chr  "Jan-2019" "Jun-2016" "Jun-2017" "Feb-2019" ...
 $ last_pymnt_amnt               : num  123 926 15813 830 10129 ...
 $ next_pymnt_d                  : chr  "" "" "" "Apr-2019" ...
 $ last_credit_pull_d            : chr  "Mar-2019" "Mar-2019" "Mar-2019" "Mar-2019" ...
 $ last_fico_range_high          : num  564 699 704 679 704 759 654 674 719 679 ...
 $ last_fico_range_low           : num  560 695 700 675 700 755 650 670 715 675 ...
 $ collections_12_mths_ex_med    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ policy_code                   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ application_type              : chr  "Individual" "Individual" "Joint App" "Individual" ...
 $ acc_now_delinq                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ tot_coll_amt                  : num  722 0 0 0 0 ...
 $ tot_cur_bal                   : num  144904 204396 189699 301500 331730 ...
 $ open_acc_6m                   : num  2 1 0 1 1 0 0 0 2 0 ...
 $ open_act_il                   : num  2 1 1 1 3 1 2 3 1 2 ...
 $ open_il_12m                   : num  0 0 0 0 0 0 0 0 0 2 ...
 $ open_il_24m                   : num  1 1 4 1 3 0 2 4 0 3 ...
 $ mths_since_rcnt_il            : num  21 19 19 23 14 338 18 13 35 10 ...
 $ total_bal_il                  : num  4981 18005 10827 12609 73839 ...
 $ il_util                       : num  36 73 73 70 84 99 63 75 57 72 ...
 $ open_rv_12m                   : num  3 2 0 1 4 0 2 0 2 0 ...
 $ open_rv_24m                   : num  3 3 2 1 7 0 3 0 7 2 ...
 $ max_bal_bc                    : num  722 6472 2081 6987 9702 ...
 $ all_util                      : num  34 29 65 45 78 76 74 55 46 49 ...
 $ total_rev_hi_lim              : num  9300 111800 14000 67300 34000 ...
 $ inq_fi                        : num  3 0 2 0 2 0 1 1 2 0 ...
 $ total_cu_tl                   : num  1 0 5 1 1 0 0 0 0 0 ...
 $ inq_last_12m                  : num  4 6 1 0 3 0 1 2 1 1 ...
 $ acc_open_past_24mths          : num  4 4 6 2 10 0 6 4 7 5 ...
 $ avg_cur_bal                   : num  20701 9733 31617 23192 27644 ...
 $ bc_open_to_buy                : num  1506 57830 2737 54962 4567 ...
 $ bc_util                       : num  37.2 27.1 55.9 12.1 77.5 ...
 $ chargeoff_within_12_mths      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ delinq_amnt                   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ mo_sin_old_il_acct            : num  148 113 125 36 128 338 142 149 164 155 ...
 $ mo_sin_old_rev_tl_op          : num  128 192 184 87 210 54 306 55 129 253 ...
 $ mo_sin_rcnt_rev_tl_op         : num  3 2 14 2 4 32 10 32 1 15 ...
 $ mo_sin_rcnt_tl                : num  3 2 14 2 4 32 10 13 1 10 ...
 $ mort_acc                      : num  1 4 5 1 6 0 4 3 1 1 ...
 $ mths_since_recent_bc          : num  4 2 101 2 4 36 12 32 4 50 ...
 $ mths_since_recent_inq         : num  4 0 10 NA 1 NA 10 8 1 10 ...
 $ mths_since_recent_revol_delinq: num  69 6 NA NA 12 NA NA NA NA NA ...
 $ num_accts_ever_120_pd         : num  2 0 0 0 0 0 0 1 0 1 ...
 $ num_actv_bc_tl                : num  2 5 2 4 4 2 4 2 6 3 ...
 $ num_actv_rev_tl               : num  4 5 3 5 6 3 6 2 9 3 ...
 $ num_bc_sats                   : num  2 13 2 8 5 2 4 3 7 3 ...
 $ num_bc_tl                     : num  5 17 4 10 9 2 5 3 10 6 ...
 $ num_il_tl                     : num  3 6 6 2 10 2 7 9 3 5 ...
 $ num_op_rev_tl                 : num  4 20 4 10 7 4 9 3 13 5 ...
 $ num_rev_accts                 : num  9 27 7 13 19 4 16 3 19 11 ...
 $ num_rev_tl_bal_gt_0           : num  4 5 3 5 6 3 6 2 9 3 ...
 $ num_sats                      : num  7 22 6 13 12 5 12 8 14 8 ...
 $ num_tl_120dpd_2m              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ num_tl_30dpd                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ num_tl_90g_dpd_24m            : num  0 0 0 0 0 0 0 1 0 0 ...
 $ num_tl_op_past_12m            : num  3 2 0 1 4 0 2 0 2 2 ...
  [list output truncated]
 - attr(*, ".internal.selfref")=<externalptr>

Count: The dataset contains approximately 2157894 approved loans.
Structure: The dataset has 110 variables capturing various aspects of each loan and borrower profile, such as credit score, loan amount, employment length, and loan status.

3.2 Rejected Dataset:

The Rejected Dataset comprises loan applications that were not approved by Lending Club. This dataset includes important information on why certain applicants were denied, based on factors like their debt-to-income ratio and risk score.

print("Rejected Dataset:")

[1] "Rejected Dataset:"

str(dfr)  # Displaying structure of rejected dataset

Classes 'data.table' and 'data.frame':  5378466 obs. of  5 variables:
 $ Debt-To-Income Ratio: chr  "3.52%" "36.2%" "14.15%" "22.1%" ...
 $ Risk_Score          : num  NA NA 653 NA NA NA NA 713 NA NA ...
 $ Application Date    : IDate, format: "2017-07-08" "2018-05-14" ...
 $ Employment Length   : chr  "< 1 year" "< 1 year" "< 1 year" "< 1 year" ...
 $ Amount Requested    : num  1000 10000 20000 30000 1000 1000 35000 19000 10000 12000 ...
 - attr(*, ".internal.selfref")=<externalptr>

Count: The rejected loan dataset contains around 5378466 rejected applications.
Structure: This dataset includes 5 variables, focusing on key aspects like the applicant’s risk score, debt-to-income ratio, and application date.

These datasets are pivotal for analyzing the factors contributing to the approval or rejection of loan applications, helping us understand credit risk management and lending patterns during the specified time period.

4 Research Question 1

How do the loan features of term, amount, and interest rate individually influence the likelihood of a loan being charged off versus being fully paid?

This question focuses on assessing risk levels and looking into trends of when the company can expect to see a loan charged off (lender has accepted the loan as unlikely to recover) or being paid off.

Assumptions: In terms of loan status, we are only analyzing the “Fully Paid” and “Charge Off” status.

4.1 Data Types & Data Cleaning

For this question, we will be looking at funded amount, loan status, term, and interest rate.

Based on this data, the funded amount and interest rate are data types that are numerical. On the other hand, loan status and term have been converted to factors and are categorical data types. While loan status logically makes sense to be categorical, term has been converted to a factor since there were only two types of loan terms (36 or 60 months) which can be seen as short term vs. long term when comparing the likelihood of default across loans. Term in this case will be viewed as nonlinear with interest rates and funded amount affecting risk with linearity.

To retrieve this information, data cleaning was done on the accepted loans dataset that initially contained 2,157,894 million observations. After filtering, the total observations stand at 1,245,285 observations.

#Subset for Specific Columns
f_df<- df[, c("funded_amnt", "loan_status","term","int_rate")]

#Filter for specific loan status
df_filter <- f_df %>%
  filter(loan_status %in% c("Charged Off", "Fully Paid"))

rm(f_df)

#Convert loan status and term as factors
df_filter$loan_status <- as.factor(df_filter$loan_status)
df_filter$term <- as.factor(df_filter$term)


#Select Relevant Columns
df_final <-df_filter %>%
  select(funded_amnt, term, int_rate, loan_status)

rm(df_filter)
#Check the Dataframe output
str(df_final)

Classes 'data.table' and 'data.frame':  1245285 obs. of  4 variables:
 $ funded_amnt: num  3600 24700 20000 10400 11950 ...
 $ term       : Factor w/ 2 levels "36 months","60 months": 1 1 2 2 1 1 1 1 1 1 ...
 $ int_rate   : num  14 12 10.8 22.4 13.4 ...
 $ loan_status: Factor w/ 2 levels "Charged Off",..: 2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr>

4.2 Normality of Loan Amount & Interest Rate

4.2.1 QQ Plot of Loan Amount

amount_plot = outlierKD2(df_final, funded_amnt, rm = TRUE, qqplt = TRUE, histogram = FALSE)

Outliers identified: 7157 
Proportion (%) of outliers: 0.6 
Mean of the outliers: 39937 
Mean without removing outliers: 14558 
Mean if we remove outliers: 14411 
Outliers successfully removed

rm(amount_plot)

The loan amount data does not appear to follow a normal distribution based on the QQ plot, as there are significant deviations from the reference line, particularly at the tails. The lower quantiles fall below the line, while the upper quantiles rise far above it, indicating the presence of heavy tails. Additionally, the stepped line in the middle suggests that the data isn’t evenly distributed. Even with the outliers removed, there are still extreme deviations at both ends therefore showing that this is not a normal distribution.

4.2.2 QQ Plot of Interest Rate

rate_plot = outlierKD2(df_final, int_rate, rm = TRUE, qqplt = TRUE, histogram = FALSE)

Outliers identified: 24826 
Proportion (%) of outliers: 2 
Mean of the outliers: 27.6 
Mean without removing outliers: 13.3 
Mean if we remove outliers: 13 
Outliers successfully removed

rm(rate_plot)

The interest rate also shows similar trends. There are major deviations from the reference line, especially in the upper tail, indicating that the interest rate distribution is not normal when outliers are present. Even with the outliers removed, the QQ plot appears much more linear showing significant improvement towards normality, though there are deviations that remain in the tails.

Overall, both loan amount and interest rate distributions are not normal when outliers are included. While showing improvement in normality when removing outliers, deviations still occur at the extremes. As a result, these datasets don’t follow normality due to these deviations.

4.3 Descriptive Statistics

Initial statistics summarizes data that can help in identifying patterns, trends, and outliers. To demonstrate this, loan status will be analyzed against each loan feature (term, amount and interest rate) to see any relationships and patterns.

4.3.1 Relationship between Loan Status and Loan Term

# Contingency Table (Correlation Analysis)
c_table <- table(df_final$term, df_final$loan_status)
cat("Contingency Table: Loan Status by Term\n\n")

Contingency Table: Loan Status by Term

print(c_table)

           
            Charged Off Fully Paid
  36 months      153426     789214
  60 months       99769     202876

#Contingency Table by Proportions
prop_table <-prop.table(c_table) *100
cat("Contingency Table: Loan Status by Term Proportions (by %)\n\n")

Contingency Table: Loan Status by Term Proportions (by %)

print(prop_table)

           
            Charged Off Fully Paid
  36 months       12.32      63.38
  60 months        8.01      16.29

rm(c_table)
# Stacked Bar Chart
ggplot(df_final, aes(x = term, fill = loan_status)) + 
  geom_bar(position = "stack") +
  labs(x="Loan Term", y="Number of Loans",title="Stacked Bar Chart of Loan Status by Term" ) +
  scale_fill_manual(values = c("navy", "lightblue"))

Via the first contingency table, 36 month loans have a higher absolute number of both fully paid and charged off loans compared to 60 month loans. Proportionally, 36 month loans have a higher rate of being charged off (12.32%) than 60 month loans (8.01%).

When visually shown on a stacked bar chart, you’ll notice that shorter terms are more likelier to get charged off than loans with a longer term.

Overall, shorter loan terms are likelier to result in a charge off and appear to influence loan status.

4.3.2 Relationship between Loan Status and Loan Amount

# Descriptive Stats of Loan Amount Separated by Loan Status

amnt_stats <- df_final %>%
  group_by(loan_status) %>%
  summarise(
    Mean = mean(funded_amnt, na.rm = TRUE),
    Median = median(funded_amnt, na.rm = TRUE),
    SD = sd(funded_amnt, na.rm = TRUE),
    Q25 = quantile(funded_amnt, 0.25, na.rm = TRUE),
    Q50 = quantile(funded_amnt, 0.50, na.rm = TRUE),
    Q75 = quantile(funded_amnt, 0.75, na.rm = TRUE),
    Min = min(funded_amnt, na.rm = TRUE),
    Max = max(funded_amnt, na.rm = TRUE)
  )

cat("Descriptive Stats of Loan Amounts Separated by Loan Status)\n\n")

Descriptive Stats of Loan Amounts Separated by Loan Status)

print(amnt_stats)

# A tibble: 2 × 9
  loan_status   Mean Median    SD   Q25   Q50   Q75   Min   Max
  <fct>        <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Charged Off 15666.  14400 8828.  9000 14400 20750  1000 40000
2 Fully Paid  14275.  12000 8722.  7500 12000 20000  1000 40000

rm(amnt_stats)

# Box Plot Comparing Loan Amounts by Loan Status
ggplot(df_final, aes(x = loan_status, y=funded_amnt, fill = loan_status)) + 
  geom_boxplot() + 
  geom_boxplot( colour="black", outlier.colour="blue", outlier.shape=8, outlier.size=4) +
  labs(title="Boxplot of Loan Amounts by Loan Status", x="Loan Status", y = "Loan Amount ($)")+ 
  scale_fill_manual(values = c("navy", "lightblue"))

Based on the statistics, the charged off loans having a higher mean and median loan amount, this can suggest that larger loan amounts may be associated with a higher risk of being charged off. Charged off loans also have a slightly higher standard deviation showing for more variability in the amounts.Lastly, both loan statuses show the same minimum and maximum but that does not necessarily mean that the distributions are identical.

Based on the boxplot, one commonality is the several outliers that exist around $40,000 for both loan statuses. However, this is not indicative that it is caused by status and requires further investigation through these high amounts being affected by other factors. Given the IQR of both statuses, there is not significant variability. You can however tell based on the IQR that the charged off loans have a slightly higher loan amounts compared to the fully paid loans that have a slightly lower loan amount.

Overall, larger loan amounts may be associated with a higher charged off risk. However holistically, loan amount alone does not appear to have a strong influence on whether a loan will be charged off or fully paid, given overlapping distributions.

4.3.3 Relationship Between Loan Status and Interest Rate

# Descriptive Stats of Interest Rate Separated by Loan Status

rate_stats <- df_final %>%
  group_by(loan_status) %>%
  summarise(
    Mean = mean(int_rate, na.rm = TRUE),
    Median = median(int_rate, na.rm = TRUE),
    SD = sd(int_rate, na.rm = TRUE),
    Q25 = quantile(int_rate, 0.25, na.rm = TRUE),
    Q50 = quantile(int_rate, 0.50, na.rm = TRUE),
    Q75 = quantile(int_rate, 0.75, na.rm = TRUE),
    Min = min(int_rate, na.rm = TRUE),
    Max = max(int_rate, na.rm = TRUE)
  )

cat("Descriptive Stats of Interest Rates Separated by Loan Status)\n\n")

Descriptive Stats of Interest Rates Separated by Loan Status)

print(rate_stats)

# A tibble: 2 × 9
  loan_status  Mean Median    SD   Q25   Q50   Q75   Min   Max
  <fct>       <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Charged Off  15.8   15.0  4.95 12.3   15.0  18.6  5.31  31.0
2 Fully Paid   12.6   12.1  4.56  9.17  12.1  15.2  5.31  31.0

rm(rate_stats)

df_count <-df_final %>%
  filter(int_rate >0) %>%
  mutate(int_rate_bin = floor(int_rate/2)*2) %>%
  filter(int_rate_bin >0) %>%
  count(int_rate_bin, loan_status)

ggplot(df_count, aes(x = int_rate_bin, y = n, color = loan_status)) + 
  geom_line(size = 1)+ 
  labs(x="Interest Rate (%)", y="Count", title = "Distribution of Interest Rates by Loan Status") +
  scale_color_manual(values = c("navy", "lightblue"))

rm(df_count)
rm(df_final)

Based on the statistics, higher interest rates significantly increase the likelihood of a loan being charged off. Charged off loans have a mean interest rate of 15.8%, compared to 12.6% for fully paid loans, with a similar difference in the median (15.1% vs. 12.1%). Even at the lower end of the interest rate range, loans that default tend to have higher rates. As charged off loans have a higher standard deviation and higher 75th percentile, this suggests greater volatility and risk. This indicates that interest rates are a strong predictor of loan defaults.

Similar to the statistics, the histogram shows interest rates having a strong influence on loan outcomes. Loans with a higher interest rate of around 15% and above are more likely to be charged off (in proportion to fully paid), indicating a greater risk of default at these levels. In contrast, lower interest rates between 10% to 12% are associated with a higher frequency of fully paid loans, suggesting that borrowers with lower rates are more likely to repay their loans in full. There is also a noticeable overlap in the 10% to 15% interest rate range, where fully paid and charged off loans occur. But as interest rates increase, the likelihood of a loan being charged off rises significantly. This suggests that interest rates are a key predictor of loan performance, with higher rates correlating strongly with loan defaults and lower rates with successful repayment.

Overall, loans with higher interest rates are likelier to be charged off, making interest rates a strong influence of loan charge offs.

columns_to_target = c("annual_inc", "dti", "fico_range_high", "fico_range_low", "loan_amnt", 
  "int_rate", "earliest_cr_line", "revol_util", "delinq_2yrs", "pub_rec", 
  "total_acc", "open_acc", "installment", "home_ownership", 
  "verification_status", "delinq_amnt", "collections_12_mths_ex_med", 
  "chargeoff_within_12_mths", "mths_since_last_delinq", "purpose", "sub_grade", "issue_d", "addr_state","inq_last_12m","loan_status","chargeoff_within_12_mths")


#df <- fetch_subset(df_location = df_file_path, col_names = columns_to_target, nrows = c(1,-1))
df_final <- df %>% select(all_of(columns_to_target))


convert_to_factors <- function(df) {
  # Columns to convert to factors based on the provided list
  factor_cols <- c("home_ownership", "verification_status", 
                   "loan_status", "purpose", "addr_state","sub_grade")
  
  df[] <- lapply(names(df), function(col) {
    if (col %in% factor_cols) {
      # Check if the column contains numeric-like values as strings and standardize them
      if (col == "chargeoff_within_12_mths") {
        # Convert '0.0' and '0' to '0', and handle similar cases
        df[[col]] <- as.character(df[[col]])
        df[[col]][df[[col]] == "0.0"] <- "0"
        df[[col]][df[[col]] == "1.0"] <- "1"  # If applicable, handle '1.0' similar to '1'
        return(as.factor(df[[col]]))  # Convert the cleaned column to factor
      } else {
        return(as.factor(df[[col]]))  # Convert other factor columns directly
      }
    } else if (is.character(df[[col]]) && all(!is.na(as.numeric(df[[col]][-1])))) {
      return(as.numeric(df[[col]]))  # Convert numeric-like strings to numeric
    } else {
      return(df[[col]])  # Leave other columns unchanged
    }
  })
  
  return(df)
}
df_final <- df_final[-1,]
df_final <- convert_to_factors(df_final)
#str(df_final)

# Convert specified columns to numeric

df_final$revol_util <- as.numeric(df_final$revol_util)
df_final$mths_since_last_delinq <- as.numeric(df_final$mths_since_last_delinq)
df_final$dti <- as.numeric(df_final$dti)
df_final$inq_last_12m <- as.numeric(df_final$inq_last_12m)
df_final$chargeoff_within_12_mths <- as.numeric(df_final$chargeoff_within_12_mths)

# Optional: Check for any warnings about NAs being introduced during coercion
if (any(is.na(df_final$revol_util))) {
  warning("NAs introduced in 'revol_util' during conversion.")
}
if (any(is.na(df_final$mths_since_last_delinq))) {
  warning("NAs introduced in 'mths_since_last_delinq' during conversion.")
}
if (any(is.na(df_final$dti))) {
  warning("NAs introduced in 'dti' during conversion.")
}


df_final$issue_d <- as.Date(df_final$issue_d, format = "%Y-%m-%d")

5 Research Question 2

How do variables such as dti, annual_inc, int_rate, and fico_range impact the likelihood of a loan being fully paid or charged off year over year from 2013 to 2018? What trends can be identified in the data, and how do these variables differ between the two loan statuses over time?

In this analysis, we aim to explore how key financial variables such as debt-to-income ratio (DTI), annual income, interest rate, and FICO score range affect the likelihood of a loan being fully paid or charged off between 2013 and 2018. By analyzing these variables over time, we can identify trends and patterns that may influence loan outcomes. Specifically, we seek to understand how these factors differ between loans that were successfully repaid versus those that were charged off, shedding light on risk factors and lending dynamics during this period.

5.1 Analysis of Charged-Off Loan Percentage (2013-2018)

In this section, we analyze the percentage of loans that were charged off (i.e., loans that borrowers were unable to fully repay) between 2013 and 2018. By aggregating loan data on a quarterly basis, we examine the trends in charged-off loans over time to gain insights into changes in lending risk and borrower behavior. Specifically, we explore how the charged-off loan percentage fluctuated over the years and consider potential factors that could explain these changes.

# LINE PLOT

# Create a copy of the dataframe
df1 <- df_final

# Step 1: Ensure 'issue_d' is in Date format
df1$issue_d <- as.Date(df1$issue_d, format = "%Y-%m-%d")  # Adjust the format if necessary

# Step 2: Extract year and quarter from 'issue_d'
df1 <- df1 %>%
  mutate(year = year(issue_d), 
         quarter = quarter(issue_d))

# Step 3: Filter out rows with NA values in year or quarter
df1 <- df1 %>%
  filter(!is.na(year), !is.na(quarter))

# Step 4: Calculate the total number of loans per quarter
df_total_per_quarter <- df1 %>%
  group_by(year, quarter) %>%
  summarise(total_loans = n(), .groups = 'drop')  # Adding .groups = 'drop' to avoid warnings

# Step 5: Filter for 'Charged Off' loan status
df_charged_off <- df1 %>%
  filter(loan_status == "Charged Off")

# Step 6: Aggregate 'Charged Off' counts by year and quarter
df_charged_off_aggregated <- df_charged_off %>%
  group_by(year, quarter) %>%
  summarise(charged_off_count = n(), .groups = 'drop')  # Adding .groups = 'drop' to avoid warnings

# Step 7: Merge the total loan counts with the 'Charged Off' counts
df_merged <- df_charged_off_aggregated %>%
  left_join(df_total_per_quarter, by = c("year", "quarter")) %>%
  mutate(percentage = (charged_off_count / total_loans) * 100)

# Step 8: Create a Year-Quarter variable in the correct format and order
df_merged <- df_merged %>%
  mutate(Year_Quarter = paste0(year, ".", quarter)) %>%
  arrange(year, quarter)  # Arrange the data to ensure it's in the correct order

# Step 9: Convert Year_Quarter into a factor with levels in the correct chronological order
df_merged$Year_Quarter <- factor(df_merged$Year_Quarter, 
                                 levels = df_merged$Year_Quarter)

# Step 10: Create the line plot with percentage and display percentage as text on each point
ggplot(df_merged, aes(x = Year_Quarter, y = percentage, group = 1)) +
  geom_area(fill = "red", alpha = 0.3) +  # Fill the area under the line with color and adjust opacity
  geom_line(color = "red", size = 1) +  # Line color and size
  geom_point(color = "red", size = 2) +  # Point color and size
  geom_text(aes(label = sprintf("%.1f%%", percentage)), 
            vjust = -0.5, size = 3, nudge_y = 0.5) +  # Display percentage above points
  labs(x = "Year-Quarter", y = "Charged Off Percentage", title = "Percentage of Charged Off Loans by Year and Quarter") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
        panel.grid.major.y = element_blank(),  # Remove horizontal grid lines
        panel.grid.minor.y = element_blank())  # Remove minor horizontal grid lines

rm(df_charged_off)
rm(df1)
rm(df_charged_off_aggregated)
rm(df_merged)
rm(df_total_per_quarter)

The charged-off percentage increases from Q1 2013 (14.9%) to Q2 2015 (18.5%), then decreases drastically until Q4 2018. There could be two explanations for this decline:

Lenders may have changed their loan selection parameters.
Since most loan tenures exceed three years, some active loans might charge off in the future (after 2018), for which we don’t have data.

5.2 Hypothesis Testing for Key Financial Features

In this section, we conduct hypothesis testing on key financial features to examine their influence on loan outcomes, specifically the likelihood of a loan being charged off. Using Welch’s t-test, we compare loans issued before and after Q2 2015 to identify significant differences in features such as annual income, debt-to-income ratio, interest rate, FICO score, and other variables that may impact loan status.

For each feature, the null hypothesis (H0) and alternative hypothesis (H1) are as follows:

Null Hypothesis (H0): There is no significant difference in the mean of the feature between loans issued before and after Q2 2015.
Alternative Hypothesis (H1): There is a significant difference in the mean of the feature between loans issued before and after Q2 2015.

Based on the p-values obtained from the t-tests, we determine whether to reject the null hypothesis in favor of the alternative hypothesis. The results help us identify which features play a significant role in determining loan outcomes over time.

# Step 1: Verify and convert 'issue_d' to Date if necessary
df_final$issue_d <- as.Date(df_final$issue_d, format = "%Y-%m-%d")  # Adjust the format if necessary

# Step 2: Filter the dataframe for 'Charged Off' loan_status and remove NaN values
charged_off_df <- df_final %>%
  filter(loan_status == "Charged Off") %>%
  filter(!is.na(issue_d))  # Ensure we have dates to work with

# Step 3: Create a new column for quarters if not already created
charged_off_df <- charged_off_df %>%
  mutate(quarter = quarter(issue_d), 
         year = year(issue_d),
         quarter_year = as.integer(format(issue_d, "%Y")) * 10 + ceiling(as.integer(format(issue_d, "%m")) / 3))

# Step 4: Check unique quarters available in the dataset
unique_quarters <- unique(charged_off_df$quarter_year)
#cat("Unique quarters in the dataset:", unique_quarters, "\n")

# Step 5: Split the data into two groups: before and including Q2 2015, and after Q2 2015
before_2015_Q2 <- charged_off_df %>%
  filter(year < 2015 | (year == 2015 & quarter <= 2))

after_2015_Q2 <- charged_off_df %>%
  filter(year > 2015 | (year == 2015 & quarter > 2))



# Proceed only if there are records in both groups
if (nrow(before_2015_Q2) > 0 && nrow(after_2015_Q2) > 0) {
  # Step 6: Identify numerical features in the filtered dataframe
  numerical_cols <- before_2015_Q2 %>%
    select_if(is.numeric) %>%
    colnames()

  # Step 7: Perform T-test for each numeric column, excluding those with zero observations in either group
  alpha <- 0.05  # Significance level
  t_test_results <- lapply(numerical_cols, function(col) {
    # Extract the two groups
    group1 <- before_2015_Q2[[col]]
    group2 <- after_2015_Q2[[col]]

    # Check if both groups have enough non-NA observations
    if (length(na.omit(group1)) > 0 && length(na.omit(group2)) > 0) {
      # Perform the t-test
      t_test_result <- t.test(group1, group2, var.equal = FALSE)  # Welch's t-test

      # Determine hypothesis status
      hypothesis_status <- ifelse(t_test_result$p.value <= alpha, "Reject H0", "Fail to Reject H0")

      return(data.frame(Feature = col, T_value = t_test_result$statistic, P_value = t_test_result$p.value, Hypothesis_Status = hypothesis_status))
    } else {
      return(NULL)  # Return NULL for insufficient data
    }
  })

  # Step 8: Remove NULL results and combine results into a dataframe
  t_test_results_df <- do.call(rbind, Filter(Negate(is.null), t_test_results))
  
  # Step 9: Remove unwanted features
  t_test_results_filtered <- t_test_results_df %>%
    filter(!Feature %in% c("quarter", "year", "quarter_year"))
  
  # Step 10: Print the filtered results
  print(t_test_results_filtered)

} else {
  cat("Not enough records in one of the groups to perform t-tests.\n")
}

                       Feature T_value   P_value Hypothesis_Status
t                   annual_inc  -14.29  2.66e-46         Reject H0
t1                         dti  -20.26  3.66e-91         Reject H0
t2             fico_range_high  -21.90 3.33e-106         Reject H0
t3              fico_range_low  -21.90 3.32e-106         Reject H0
t4                   loan_amnt   -1.84  6.65e-02 Fail to Reject H0
t5                    int_rate    7.60  2.99e-14         Reject H0
t6                  revol_util   65.33  0.00e+00         Reject H0
t7                 delinq_2yrs   -6.79  1.11e-11         Reject H0
t8                     pub_rec  -18.93  7.76e-80         Reject H0
t9                   total_acc   21.91 2.49e-106         Reject H0
t10                   open_acc   -6.28  3.30e-10         Reject H0
t11                installment  -16.88  6.75e-64         Reject H0
t12                delinq_amnt   -4.22  2.42e-05         Reject H0
t13 collections_12_mths_ex_med  -15.26  1.42e-52         Reject H0
t14   chargeoff_within_12_mths   -2.94  3.32e-03         Reject H0
t15     mths_since_last_delinq    1.18  2.39e-01 Fail to Reject H0

rm(after_2015_Q2)
rm(before_2015_Q2)
rm(charged_off_df)
rm(t_test_results)
rm(t_test_results_df)
rm(t_test_results_filtered)

Annual Income (annual_inc):
- T-value = -14.29, P-value = 2.66e-46
- Result: Reject H0
- Interpretation: Significant difference in annual income, indicating that it is a key feature for loan outcomes.
Debt-to-Income Ratio (dti):
- T-value = -20.26, P-value = 3.66e-91
- Result: Reject H0
- Interpretation: Significant difference in debt-to-income ratio, suggesting its importance in distinguishing loan status.
FICO Range High (fico_range_high) and FICO Range Low (fico_range_low):
- T-value = -21.90, P-value = 3.33e-106 (high), 3.32e-106 (low)
- Result: Reject H0
- Interpretation: Both FICO score ranges are critical indicators of loan status.
Loan Amount (loan_amnt):
- T-value = -1.84, P-value = 6.65e-02
- Result: Fail to Reject H0
- Interpretation: No significant difference in loan amount with respect to loan outcomes.
Interest Rate (int_rate):
- T-value = 7.60, P-value = 2.99e-14
- Result: Reject H0
- Interpretation: Interest rate is a significant differentiator in loan status.
Revolving Utilization (revol_util):
- T-value = 65.33, P-value = 0.00e+00
- Result: Reject H0
- Interpretation: Strong influence of revolving utilization on loan status.
Delinquencies in 2 Years (delinq_2yrs):
- T-value = -6.79, P-value = 1.11e-11
- Result: Reject H0
- Interpretation: Significant impact of past delinquencies on loan status.
Public Records (pub_rec):
- T-value = -18.93, P-value = 7.76e-80
- Result: Reject H0
- Interpretation: Public records significantly affect loan outcomes.
Total Accounts (total_acc):
- T-value = 21.91, P-value = 2.49e-106
- Result: Reject H0
- Interpretation: The total number of accounts plays a critical role in determining loan status.
Open Accounts (open_acc):
- T-value = -6.28, P-value = 3.30e-10
- Result: Reject H0
- Interpretation: Number of open accounts significantly influences loan status.
Installment Amount (installment):
- T-value = -16.88, P-value = 6.75e-64
- Result: Reject H0
- Interpretation: Installment amount is a significant factor in loan status differentiation.
Delinquency Amount (delinq_amnt):
- T-value = -4.22, P-value = 2.42e-05
- Result: Reject H0
- Interpretation: Past delinquency amounts significantly impact loan status.
Collections in 12 Months (collections_12_mths_ex_med):
- T-value = -15.26, P-value = 1.42e-52
- Result: Reject H0
- Interpretation: The number of collections in 12 months is a key feature influencing loan status.
Charge-off within 12 Months (chargeoff_within_12_mths):
- T-value = -2.94, P-value = 3.32e-03
- Result: Reject H0
- Interpretation: Significant difference in charge-off occurrences within 12 months.
Months Since Last Delinquency (mths_since_last_delinq):
- T-value = 1.18, P-value = 2.39e-01
- Result: Fail to Reject H0
- Interpretation: No significant difference in months since the last delinquency.

5.3 Correlation Plot Analysis

In this section, we analyze the relationships between key financial variables using a correlation matrix. The correlation plot helps identify how different variables, such as income, loan amount, credit score, interest rate, and other features, interact with each other. By understanding these correlations, we can gain insights into how specific factors, like income or credit score, may influence loan characteristics and outcomes.

The correlation matrix is visualized to highlight both positive and negative relationships between variables, offering a comprehensive overview of how these financial features are interconnected and how they might affect borrower behavior or loan performance.

#COOR PLOT



setDT(df_final)  # Converts df to a data.table, if it isn't one already

# Select only numeric columns from the data table
numeric_df <- df_final[, .SD, .SDcols = sapply(df_final, is.numeric)]

# Calculate the correlation matrix
cor_matrix <- cor(numeric_df, use = "pairwise.complete.obs")
# Set plotting parameters for larger plot size
# Create a larger plot window
options(repr.plot.width = 35, repr.plot.height = 45)  # Use this in RMarkdown or Jupyter Notebooks

# Plot the correlation matrix using corrplot without the title
corrplot(cor_matrix,
         type = "upper", 
         method = "circle", 
         addCoef.col = 0.2,  # Color for coefficients
         number.cex = 0.1,  # Size of the correlation coefficients
         tl.cex = 0.6,      # Size of the text labels for variables
         tl.col = "black",  # Color of the text labels
         cl.cex = 0.5)      # Size of the color legend text

# Remove numeric_df
rm(numeric_df)
rm(cor_matrix)

Income and Loan Characteristics:
- annual_inc has a moderate positive correlation with loan_amnt (0.1949) and installment (0.188), suggesting that individuals with higher income tend to take larger loans with higher installments.
- dti has a weak positive correlation with int_rate (0.1239), indicating a slight increase in interest rates for borrowers with higher debt-to-income ratios.
Credit Score Indicators:
- fico_range_high and fico_range_low have strong negative correlations with int_rate (-0.4049), indicating that better credit scores are associated with lower interest rates.
- fico_range_high and fico_range_low are negatively correlated with revol_util (-0.4747), suggesting that individuals with higher credit scores tend to have lower revolving credit utilization.
Delinquency Metrics:
- delinq_2yrs has a strong negative correlation with mths_since_last_delinq (-0.5524), indicating that more recent delinquencies are associated with a higher number of delinquencies in the past two years.
- pub_rec shows a weak negative correlation with loan_amnt (-0.0628), suggesting a slight decrease in loan amounts with higher public records of delinquencies.
Open Accounts and Total Accounts:
- total_acc and open_acc are highly correlated (0.7199), which is expected as they both measure account quantities in credit reports.
- total_acc has a weak positive correlation with loan_amnt (0.1966), indicating that borrowers with more accounts tend to have slightly higher loan amounts.

5.4 Analysis of Charged-Off Loans by State

In this section, we examine the geographical distribution of charged-off loans across U.S. states, focusing on changes in loan performance before and after Q2 2015. By analyzing the percentage of charged-off loans by state, we aim to identify regional trends and assess the impact of interventions implemented during this period. Using geographic visualization, we compare the frequency of charged-off loans across different states to uncover potential regional disparities in loan outcomes, and evaluate the success of interventions in improving loan performance in specific areas.

# US MAP

# Load the state coordinates
state_coords <- read.csv("../DataSet/states.csv")

# Step 1: Verify and convert 'issue_d' to Date if necessary
df_final$issue_d <- as.Date(df_final$issue_d, format = "%Y-%m-%d")  # Adjust the format if necessary

# Step 2: Filter the dataframe and remove NaN values
charged_off_df <- df_final %>%
  filter(!is.na(issue_d), !is.na(addr_state))  # Ensure we have dates and states to work with

# Step 3: Create a new column for quarters
charged_off_df <- charged_off_df %>%
  mutate(quarter_year = as.integer(format(issue_d, "%Y")) * 10 + ceiling(as.integer(format(issue_d, "%m")) / 3))

# Step 4: Split the data into two groups: before and including Q2 2015, and after Q2 2015
before_2015_Q2 <- charged_off_df %>%
  filter(quarter_year <= 20152)

after_2015_Q2 <- charged_off_df %>%
  filter(quarter_year > 20152)

# Step 5: Calculate total loans and charged off loans by state for both periods
total_loans_before <- before_2015_Q2 %>%
  group_by(addr_state) %>%
  summarise(total_loans = n(), .groups = "drop")

charged_off_loans_before <- before_2015_Q2 %>%
  filter(loan_status == "Charged Off") %>%
  group_by(addr_state) %>%
  summarise(charged_off_loans = n(), .groups = "drop")

total_loans_after <- after_2015_Q2 %>%
  group_by(addr_state) %>%
  summarise(total_loans = n(), .groups = "drop")

charged_off_loans_after <- after_2015_Q2 %>%
  filter(loan_status == "Charged Off") %>%
  group_by(addr_state) %>%
  summarise(charged_off_loans = n(), .groups = "drop")

# Step 6: Merge totals and charged off loans for both periods
frequency_data_before <- total_loans_before %>%
  left_join(charged_off_loans_before, by = "addr_state") %>%
  mutate(charged_off_loans = ifelse(is.na(charged_off_loans), 0, charged_off_loans),
         percentage_charged_off = (charged_off_loans / total_loans) * 100)

frequency_data_after <- total_loans_after %>%
  left_join(charged_off_loans_after, by = "addr_state") %>%
  mutate(charged_off_loans = ifelse(is.na(charged_off_loans), 0, charged_off_loans),
         percentage_charged_off = (charged_off_loans / total_loans) * 100)

# Step 7: Merge both frequency data with state coordinates
frequency_data <- state_coords %>%
  left_join(frequency_data_before %>% select(addr_state, percentage_charged_off) %>% rename(percentage_charged_off_before = percentage_charged_off), 
            by = c("state" = "addr_state")) %>%
  left_join(frequency_data_after %>% select(addr_state, percentage_charged_off) %>% rename(percentage_charged_off_after = percentage_charged_off), 
            by = c("state" = "addr_state"))



# Step 8: Filter to keep only points within the contiguous US
frequency_data <- frequency_data %>%
  filter(latitude >= 24.396308 & latitude <= 49.384358 & 
         longitude >= -125.0 & longitude <= -66.93457)

# Step 9: Create the map
us_map <- map_data("state")
# Step 10: Plot with state symbols
ggplot() +
  # Base map
  geom_polygon(data = us_map, aes(x = long, y = lat, group = group), fill = "lightgrey") +
  
  # Points for charged off loans before Q2 2015
  geom_point(data = frequency_data, 
             aes(x = longitude, y = latitude, 
                 size = percentage_charged_off_before, 
                 color = "Before Q2 2015"), 
             alpha = 0.5) +
  
  # Points for charged off loans after Q2 2015
  geom_point(data = frequency_data, 
             aes(x = longitude, y = latitude, 
                 size = percentage_charged_off_after, 
                 color = "After Q2 2015"), 
             alpha = 0.5) +
  
  # Add state abbreviations
  geom_text(data = frequency_data, aes(x = longitude, y = latitude, label = state), 
            color = "black", size = 3, vjust = -0.8) +  # Adjust text size and position
  
  # Adjust the size scale
  scale_size_continuous(
    range = c(3, 10),  # Adjusting the size range
    name = "% of Charged Off Loans"
  ) +
  
  # Define color scale
  scale_color_manual(values = c("Before Q2 2015" = "red", "After Q2 2015" = "green"),
                     name = "Time Period") +
  
  # Labels and title
  labs(title = "% of Charged Off Loans by State (Before and After Q2 2015)",
       x = "Longitude",
       y = "Latitude") +
  
  theme_minimal() +
  theme(legend.position = "top")  # Place legend at the top

rm(after_2015_Q2)
rm(before_2015_Q2)
rm(charged_off_df)
rm(charged_off_loans_after)
rm(charged_off_loans_before)
rm(us_map)
rm(frequency_data)
rm(frequency_data_after)
rm(state_coords)
rm(total_loans_after)
rm(total_loans_before)
rm(frequency_data_before)

Iowa (IA):
- Before: 50.0%
- After: NA
- Insight: The charged-off percentage was significantly high before the intervention, but data for after the intervention is not available, indicating a potential area for further investigation.
Nebraska (NE):
- Before: 50.0%
- After: 11.51%
- Insight: A substantial decrease from 50.0% to 11.51%, indicating a successful intervention that significantly improved loan performance.
Florida (FL):
- Before: 18.6%
- After: 10.52%
- Insight: A notable decrease of 8.08 percentage points, suggesting a positive impact of the intervention.
Mississippi (MS):
- Before: 21.4%
- After: 11.87%
- Insight: The charged-off percentage decreased by 9.53 percentage points, indicating improved loan performance.
California (CA):
- Before: 16.9%
- After: 10.32%
- Insight: A reduction of 6.58 percentage points, demonstrating effective measures in managing loan defaults.

The analysis highlights states like Nebraska and Iowa as significant cases with varying outcomes in charged-off loans. The successful interventions in Nebraska indicate that targeted strategies can yield substantial improvements in loan performance.

5.5 Comparison of Debt-to-Income Ratios: Charged-Off vs. Fully Paid Loans (2013-2018)

This section focuses on the debt-to-income (DTI) ratio and its relationship with loan outcomes, specifically comparing charged-off and fully paid loans between 2013 and 2018. The DTI ratio, a key metric used to assess borrowers’ ability to manage loan repayments, is analyzed to identify its role as a potential risk factor for loan defaults. By visualizing DTI distributions across different quarters using box plots, we highlight differences between charged-off and fully paid loans and explore trends over time, providing insights into how DTI may influence loan performance.

# BOX PLOT

df1 <- df_final
# Ensure issue_d is in Date format
df1$issue_d <- as.Date(df1$issue_d, format = "%b-%Y")

# Filter out rows where issue_d is NA
df1 <- df1 %>%
  filter(!is.na(issue_d))

# Extract year and quarter from issue_d
df1$quarter <- paste0("Q", quarter(df1$issue_d), " ", year(df1$issue_d))

# Filter the data for only "Charged Off" and "Fully Paid" loan statuses
df_filtered <- df1 %>%
  filter(loan_status %in% c("Charged Off", "Fully Paid"))

# Define a function to remove outliers based on IQR
remove_outliers <- function(data) {
  Q1 <- quantile(data$dti, 0.25, na.rm = TRUE)
  Q3 <- quantile(data$dti, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  data %>%
    filter(dti >= lower_bound & dti <= upper_bound)
}

# Remove outliers by applying the function to each quarter and loan_status group
df_no_outliers <- df_filtered %>%
  group_by(quarter, loan_status) %>%
  do(remove_outliers(.))

# Ensure proper order of quarters
df_no_outliers$quarter <- factor(df_no_outliers$quarter, 
                                 levels = unique(df_no_outliers$quarter[order(df_no_outliers$issue_d)]))

# Plot box plot for dti for each quarter with loan_status distinction (without outliers)
ggplot(df_no_outliers, aes(x = quarter, y = dti, fill = loan_status)) +
  geom_boxplot() +
  labs(title = "DTI Box Plot by Quarter for Charged Off and Fully Paid Loans",
       x = "Quarter",
       y = "DTI",
       fill = "Loan Status") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

rm(df_filtered)
rm(df1)
rm(df_no_outliers)
rm(expected_counts)
rm(chi_sq_result)

Median DTI Comparison: Each quarter, the median DTI for charged-off applicants is higher compared to fully paid applicants, with an average difference of 4.3%.
Trends Over Time: The medians of DTI show an upward trend toward 2015, followed by a slight decrease until 2018.

The analysis indicates that higher DTI ratios are associated with charged-off loans, suggesting a potential risk factor for loan defaults. The observed trends in DTI medians may warrant further investigation to understand the underlying causes.

5.6 Radar Chart Analysis of Loan Characteristics (2014 vs. 2018)

In this section, we use radar charts to compare key loan characteristics between loans issued in 2014 and 2018. By standardizing variables such as interest rates, annual income, FICO score range, and debt-to-income (DTI) ratio, we visualize changes in borrower profiles and loan terms over time. This analysis highlights shifts in loan affordability, borrower financial health, and creditworthiness, offering insights into how lending practices evolved during this period.

# RADAR CHART

# Step 1: Filter DataFrames for loans issued in 2014 and 2017
df_2014 <- df_final[df_final$issue_d >= as.Date("2015-01-01") & df_final$issue_d <= as.Date("2015-12-31"), ]
df_2017 <- df_final[df_final$issue_d >= as.Date("2018-01-01") & df_final$issue_d <= as.Date("2018-12-31"), ]

# Step 2: Function to remove outliers using IQR
# Function to remove outliers using IQR method
remove_outliers <- function(df, columns) {
  for (col in columns) {
    if (col %in% names(df)) {
      Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
      Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
      IQR_value <- IQR(df[[col]], na.rm = TRUE)
      
      # Define lower and upper bounds
      lower_bound <- Q1 - 1.5 * IQR_value
      upper_bound <- Q3 + 1.5 * IQR_value
      
      # Filter out the outliers
      df <- df[df[[col]] >= lower_bound & df[[col]] <= upper_bound, ]
    } else {
      message(paste("Column", col, "not found in the dataframe. Skipping."))
    }
  }
  return(df)
}

# Step 3: Select the columns of interest for standardization and outlier removal
selected_columns <- c("int_rate", "annual_inc", "fico_range_low", "fico_range_high", "dti")
# Remove outliers in each group
df_2014 <- remove_outliers(df_2014, selected_columns)
df_2017 <- remove_outliers(df_2017, selected_columns)

# Step 4: Function to standardize numeric columns (scaling from 0 to 1)
# Convert df_2014 to a data.table
setDT(df_2014)
setDT(df_2017)

# Define the standardization function using data.table syntax
standardize_columns <- function(dt, selected_cols) {
  dt[, (selected_cols) := lapply(.SD, function(x) (x - min(x, na.rm = TRUE)) / 
                                               (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))), 
       .SDcols = selected_cols]
  return(dt)
}


# Standardize the selected columns in each group
df_2014 <- standardize_columns(df_2014, selected_columns)
df_2017 <- standardize_columns(df_2017, selected_columns)

# Step 5: Calculate the medians for selected numeric columns
medians_2014 <- df_2014[, lapply(.SD, median, na.rm = TRUE), .SDcols = selected_columns]
medians_2017 <- df_2017[, lapply(.SD, median, na.rm = TRUE), .SDcols = selected_columns]

# Step 6: Combine the medians into a data frame for radar chart
radar_data_2014_2017 <- rbind(as.data.frame(medians_2014), 
                               as.data.frame(medians_2017))

# Step 7: Assign appropriate row names for each category
rownames(radar_data_2014_2017) <- c("2014", "2017")

# Step 8: Add max and min rows for radar chart
radar_data_2014_2017 <- rbind(max = rep(1, ncol(radar_data_2014_2017)), 
                               min = rep(0, ncol(radar_data_2014_2017)), 
                               radar_data_2014_2017)

# Set the background color to white
par(bg = "white")  
# Step 9: Plot the radar chart for 2014 and 2017
radarchart(radar_data_2014_2017, 
           axistype = 1,  # Set axis type
           pcol = c("#1E90FF", "#32CD32"),  # Brighter line colors for 2014 and 2017
           plty = 1,  # Line type
           title = "Comparison of Loan Variables (2014 vs. 2017)",
           cglcol = "lightgrey",  # Lighter grid line color
           cglty = 1,  # Type of the grid lines
           caxislabels = seq(0, 1, 0.1),  # Customize axis labels (0 to 1 range)
           axislabcol = "black",  # Axis label color
           vlcex = 0.8,  # Text size of labels
           titlecol = "black",  # Title color
           cglwd = 0.8)  # Thickness of the grid lines

# Step 10: Add legend to the radar chart
legend("topright",  # Position of the legend
       legend = c("2014", "2017"),  # Labels for the legend
       col = c("#1E90FF", "#32CD32"),  # Corresponding bright colors
       lty = 1,  # Line type in the legend
       bty = "n",  # No box around the legend
       text.col = "black",  # Text color for the legend
       cex = 0.8)  # Text size of the legend

rm(df_2014)
rm(df_2017)
rm(medians_2014)
rm(medians_2017)
rm(radar_data_2014_2017)

Interest Rate (int_rate): The interest rate decreased from 36.8% in 2014 to 30.4% in 2017, indicating an improvement in loan affordability over time.
Annual Income (annual_inc): The annual income remained relatively stable, with a slight decline from 38.1% in 2014 to 38.0% in 2017, suggesting that income levels did not significantly impact loan eligibility during this period.
FICO Score Range: The lower bound of the FICO score range increased from 26.3% in 2014 to 30.8% in 2017, indicating a potential shift towards lending to higher credit quality borrowers.
Debt-to-Income Ratio (dti): The debt-to-income ratio improved slightly, decreasing from 42.8% in 2014 to 39.0% in 2017, suggesting better financial health among borrowers over time.

5.7 Contour Plot Comparing Risk Score vs. Debt-to-Income Ratio Before and After 2015

This section examines the relationship between risk scores and debt-to-income (DTI) ratios for rejected loan applications, utilizing contour plots to visualize the distributions before and after 2015. By analyzing these trends, we aim to identify shifts in lending criteria and applicant profiles over time. The findings will provide insights into how lenders may have adjusted their risk assessment practices, reflecting broader economic changes and evolving borrower risk profiles.

# Contour PLOT

# Clone the original dataframe
dfr1 <- dfr

# Convert Application Date to Date format if not already done
dfr1$Application_Date <- as.Date(dfr1$'Application Date')

# Convert Risk_Score to numeric, handling NAs
dfr1$Risk_Score <- as.numeric(dfr1$Risk_Score)

# Convert Debt-To-Income Ratio to numeric by removing "%" and converting to decimal
dfr1$Debt_To_Income_Ratio <- as.numeric(gsub("%", "", dfr1$`Debt-To-Income Ratio`)) / 100

# Filter for application dates before and after 2015
dfr_before_2015 <- dfr1 %>% filter(Application_Date < as.Date("2015-01-01"))
dfr_after_2015 <- dfr1 %>% filter(Application_Date >= as.Date("2015-01-01"))

# Define outlier removal function using IQR
remove_outliers <- function(data) {
  # Calculate Q1, Q3, and IQR for Risk_Score
  Q1_risk <- quantile(data$Risk_Score, 0.25, na.rm = TRUE)
  Q3_risk <- quantile(data$Risk_Score, 0.75, na.rm = TRUE)
  IQR_risk <- IQR(data$Risk_Score, na.rm = TRUE)
  
  # Calculate Q1, Q3, and IQR for Debt_To_Income_Ratio
  Q1_dti <- quantile(data$Debt_To_Income_Ratio, 0.25, na.rm = TRUE)
  Q3_dti <- quantile(data$Debt_To_Income_Ratio, 0.75, na.rm = TRUE)
  IQR_dti <- IQR(data$Debt_To_Income_Ratio, na.rm = TRUE)

  # Filter data to remove outliers
  data_filtered <- data %>%
    filter(
      Risk_Score >= (Q1_risk - 1.5 * IQR_risk) & Risk_Score <= (Q3_risk + 1.5 * IQR_risk) &
      Debt_To_Income_Ratio >= (Q1_dti - 1.5 * IQR_dti) & Debt_To_Income_Ratio <= (Q3_dti + 1.5 * IQR_dti)
    )
  
  return(data_filtered)
}

# Remove outliers from both datasets
dfr_before_2015_clean <- remove_outliers(dfr_before_2015)
dfr_after_2015_clean <- remove_outliers(dfr_after_2015)

# Combine the datasets for plotting
dfr_combined <- bind_rows(
  dfr_before_2015_clean %>% mutate(Period = "Before 2015"),
  dfr_after_2015_clean %>% mutate(Period = "After 2015")
)

# Create a combined contour plot
ggplot(dfr_combined, aes(x = Risk_Score, y = Debt_To_Income_Ratio, color = Period)) +
  geom_density_2d() +
  labs(title = "Contour Plot of Risk Score vs Debt-To-Income Ratio",
       x = "Risk Score",
       y = "Debt-To-Income Ratio") +
  scale_color_manual(values = c("Before 2015" = "blue", "After 2015" = "red")) +
  theme_minimal() +
  theme(legend.title = element_blank())

# Clean up
rm(dfr_before_2015)
rm(dfr_after_2015)
rm(dfr1)
rm(dfr_before_2015_clean)
rm(dfr_after_2015_clean)

General Distribution:
The contour plot shows distinct areas of concentration of rejected loan applications with varying risk scores and debt-to-income (DTI) ratios. The highest concentration for both periods is around the DTI ratio of 0.2 to 0.3 and risk scores between 600 and 700, indicating that a significant portion of rejections occurred in this range.
Before 2015 vs After 2015:
- The blue contours (Before 2015) indicate a denser concentration of rejected applicants with risk scores between 600 and 700 and a slightly higher DTI ratio compared to red contours (After 2015).
- After 2015, the contours suggest a shift in the distribution of rejected loans, with more applicants being rejected both at lower risk scores (500-600) and higher risk scores (700+). This indicates that rejection criteria may have broadened to encompass a wider range of applicants.
Debt-to-Income Ratio:
There is a noticeable shift in the red contour (After 2015) towards a higher DTI ratio (around 0.4 to 0.6) at various risk levels, compared to blue contours (Before 2015). This suggests that after 2015, rejected applicants tended to have higher debt burdens relative to their income, potentially reflecting either more lenient application behavior or tighter lending standards.
Risk Concentration Shifts:
- The After 2015 data shows a wider spread in the DTI ratio for rejected applicants with risk scores between 500 and 600, indicating that applicants with lower risk scores are increasingly being rejected for higher debt-to-income ratios.
- In contrast, Before 2015, individuals with lower risk scores (below 600) who were rejected had tighter DTI ratios, suggesting that lending practices were more stringent in this category before 2015.
Implication on Risk:
The increased spread in both the risk scores and DTI ratios for rejected loans after 2015 may indicate changes in risk assessment and rejection criteria. Lenders might have shifted their focus, possibly due to economic changes post-2015, leading to the rejection of both higher-risk and lower-risk applicants with varying debt profiles.

5.8 Histogram Analysis of Debt-to-Income Ratios (DTI) for Rejected Loans

In this section, we conduct a histogram analysis of debt-to-income (DTI) ratios for rejected loans, focusing on data from before and after 2015. By visualizing the frequency distribution of DTI ratios, we aim to uncover trends in lending practices and the criteria used for loan rejection over time.

The histogram will reveal how the distribution of DTI values has shifted, highlighting any significant changes in rejection patterns associated with varying DTI levels. This analysis will provide insights into the evolving risk assessment strategies employed by lenders, particularly in response to economic conditions and regulatory changes post-2015.

# HISTOGRAM 

# Load necessary libraries
library(ggplot2)
library(dplyr)

dfr1 <- dfr

# Ensure the Debt-To-Income Ratio is numeric
dfr1$dti <- as.numeric(gsub("%", "", dfr1$`Debt-To-Income Ratio`))  # Remove '%' and convert to numeric

# Ensure the Application Date is in Date format
dfr1$`Application Date` <- as.Date(dfr1$`Application Date`, format = "%Y-%m-%d")

# Remove NA values (if any) in Debt-To-Income Ratio and Application Date
dfr1 <- dfr1 %>% filter(!is.na(dti) & !is.na(`Application Date`))

# Define a function to remove outliers based on IQR
remove_outliers <- function(data) {
  Q1 <- quantile(data$dti, 0.25, na.rm = TRUE)  # 1st quartile (25th percentile)
  Q3 <- quantile(data$dti, 0.75, na.rm = TRUE)  # 3rd quartile (75th percentile)
  IQR <- Q3 - Q1  # Interquartile range
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  data %>%
    filter(dti >= lower_bound & dti <= upper_bound)  # Remove outliers
}

# Apply the function to remove outliers
dfr_no_outliers <- remove_outliers(dfr1)

# Split the data into before and after 2015
dfr_before_2015 <- dfr_no_outliers %>% filter(`Application Date` < as.Date("2015-01-01"))
dfr_after_2015 <- dfr_no_outliers %>% filter(`Application Date` >= as.Date("2015-01-01"))

# Create the frequency distribution plot using ggplot
ggplot() +
  geom_histogram(data = dfr_before_2015, aes(x = dti, y = ..count.., color = "Before 2015", fill = "Before 2015"), 
                 binwidth = 1, alpha = 0.4, position = "identity") +
  geom_histogram(data = dfr_after_2015, aes(x = dti, y = ..count.., color = "After 2015", fill = "After 2015"), 
                 binwidth = 1, alpha = 0.4, position = "identity") +
  labs(title = "Frequency Distribution of Debt-To-Income Ratio (Approved Loans)",
       x = "Debt-To-Income Ratio (%)",
       y = "Frequency",
       color = "Period",
       fill = "Period") +
  theme_minimal() +
  scale_color_manual(values = c("Before 2015" = "blue", "After 2015" = "red")) +
  scale_fill_manual(values = c("Before 2015" = "blue", "After 2015" = "red")) +
  theme(legend.position = "top")

# Optional: Clean up memory
rm(dfr1)
rm(dfr_no_outliers)
rm(dfr_before_2015)
rm(dfr_after_2015)
rm(df_no_outliers)

DTI Distribution Before 2015: The histogram for DTI values prior to 2015 shows a relatively even distribution across various DTI ranges, with no significant concentration in any specific range.
DTI Distribution After 2015:
- A notable increase in rejections is observed for DTI values less than 5, indicating stricter lending criteria during this period.
- Following the initial spike at low DTI values, there is a gradual decline in rejection counts as DTI increases towards 9.
- Beyond a DTI of 9, the distribution appears to be right-skewed, suggesting that while there are fewer rejections in this range, those with very high DTIs are still facing rejections.
Overall Trend: The analysis indicates a shift in lending standards after 2015, where lower DTIs are increasingly leading to loan rejections, potentially reflecting a more cautious approach by lenders to manage risk.

rm(dfr_combined)
rm(dfr1)

6 Research Question 3

Do the loan grades provided to each customer correlate to their loan repayment behavior based on income and loan status?

df_final <- df %>%

  mutate(loan_status_num = ifelse(loan_status == "Fully Paid", 1, 
                                  ifelse(loan_status == "Charged Off", 0, NA)))

6.1 Loan Repayment By Grade

6.1.1 BAR PLOT

In this section, we analyze the loan repayment rates categorized by loan grades using a bar plot. By transforming loan status into a binary variable—where “Fully Paid” is coded as 1 and “Charged Off” as 0—we can effectively calculate the repayment rate for each grade.

The bar plot will visually represent the correlation between loan grades and repayment rates, revealing trends that indicate how loan quality influences repayment behavior. This analysis is crucial for understanding the risk associated with different loan grades, highlighting significant disparities in repayment performance. The findings will underscore the importance of grading in assessing loan performance and guiding lending strategies.

# Assuming "Fully Paid" = 1 and "Charged Off" = 0
df_final <- df_final %>%
  mutate(loan_status_binary = ifelse(loan_status == "Fully Paid", 1, 
                                     ifelse(loan_status == "Charged Off", 0, NA)))

# Calculate repayment rate by grade
repayment_rate <- df_final %>%
  group_by(grade) %>%
  summarise(repayment_rate = mean(loan_status_binary, na.rm = TRUE)) %>%
  mutate(grade = factor(grade, levels = c("A", "B", "C", "D", "E", "F", "G")))  # Ensure correct order

# Create the enhanced bar plot
ggplot(repayment_rate, aes(x = grade, y = repayment_rate, fill = grade)) +
  geom_bar(stat = "identity", width = 0.7) +
  geom_text(aes(label = scales::percent(repayment_rate, accuracy = 0.1)), 
            vjust = -0.5, size = 4, fontface = "bold") +
  scale_y_continuous(labels = scales::percent_format(), limits = c(0, 1)) +
  scale_fill_viridis_d() +
  labs(title = "Loan Repayment Rate by Grade",
       subtitle = "Higher grades show better repayment rates",
       x = "Loan Grade",
       y = "Repayment Rate") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 12, face = "italic"),
    axis.title = element_text(face = "bold"),
    axis.text = element_text(size = 10), 
    legend.position = "none"
  )

rm(repayment_rate)

Clear correlation between grade and repayment: There’s a strong positive relationship between loan grade and repayment rate. As the grade improves from G to A, the repayment rate steadily increases.
Significant gap between top and bottom grades: The highest grade (A) has a repayment rate of 94%, while the lowest grade (G) is at 49.2%—a difference of nearly 45 percentage points. This highlights the substantial risk difference between the best and worst-rated loans.
Steeper drops in lower grades: The decrease in repayment rate appears more pronounced between lower grades (e.g., from E to F to G) compared to higher grades. This suggests that risk increases more rapidly as loan quality declines below a certain threshold.

6.2 Income Distribution By Loan Grade

6.2.1 BOX PLOT

In this section, we analyze the distribution of annual income across different loan grades using a box plot. The box plot provides a visual representation of the income variability within each grade while also highlighting the median income. To ensure accuracy, we employ an outlier removal technique based on the interquartile range (IQR), which helps to focus the analysis on the central tendency of the income distribution. Our findings reveal a clear correlation between loan grade and income, with higher grades consistently associated with greater median incomes. Additionally, we observe that the income disparity narrows among lower grades, indicating that other factors may play a significant role in determining loan grades. This analysis not only emphasizes the importance of income in the lending process but also suggests that additional criteria may influence loan classification, particularly in the lower-grade categories.

# Function to remove outliers using IQR method
remove_outliers <- function(x) {
  qnt <- quantile(x, probs = c(0.25, 0.75), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  x[x < (qnt[1] - H) | x > (qnt[2] + H)] <- NA
  return(x)
}

# Prepare the data and remove outliers
plot_data <- df_final %>%
  filter(!is.na(annual_inc) & annual_inc > 0) %>%
  mutate(grade = factor(grade, levels = c("A", "B", "C", "D", "E", "F", "G"))) %>%
  group_by(grade) %>%
  mutate(annual_inc_clean = remove_outliers(annual_inc)) %>%
  filter(!is.na(annual_inc_clean)) %>%
  ungroup()

# Calculate median income for annotation
median_incomes <- plot_data %>%
  group_by(grade) %>%
  summarise(median_income = median(annual_inc_clean))

# Create the enhanced boxplot without outliers
income_plot <- ggplot(plot_data, aes(x = grade, y = annual_inc_clean, fill = grade)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_text(data = median_incomes, aes(y = median_income, label = scales::dollar(median_income)),
            vjust = -0.5, size = 3, fontface = "bold") +
  scale_y_log10(labels = scales::dollar_format(), 
                breaks = scales::trans_breaks("log10", function(x) 10^x),
                minor_breaks = NULL,
                limits = c(1000, NA)) +  # Set lower limit to reduce space
  scale_fill_viridis_d(option = "D", begin = 0.3, end = 0.9) +
  labs(title = "Income Distribution by Loan Grade",
       x = "Loan Grade",
       y = "Annual Income") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, size = 22, face = "bold", margin = margin(b = 10)),
    axis.title = element_text(face = "bold", size = 16),
    axis.text = element_text(size = 12),
    legend.position = "none",
    panel.grid.major = element_line(color = "gray90"),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "gray98", color = NA),
    plot.margin = margin(t = 20, r = 20, b = 20, l = 20)
  ) +
  coord_cartesian(clip = "off") +
  annotate("text", x = Inf, y = Inf, label = "BOXPLOT",
           hjust = 1.1, vjust = 2, size = 4.5, fontface = "italic", color = "gray30")

# Display the plot (optional)
print(income_plot)

# Save the plot
ggsave("income_distribution_boxplot_no_outliers.png", plot = income_plot, width = 12, height = 8, dpi = 300)
rm(income_plot)
rm(median_incomes)
rm(plot_data)

Higher income for better grades: There’s a clear trend of higher median incomes for better loan grades. Grade A has the highest median income at $74,000, while lower grades (C through G) have lower median incomes around $60,000-$62,000.
Income disparity narrows in lower grades: The income distributions for grades C through G are very similar, with median incomes clustered closely together. This suggests that factors other than income may be more influential in determining these lower grades.
Wider income range for top grades: The boxplots for grades A and B appear to have larger interquartile ranges, indicating more income variability among borrowers in these categories. This could suggest that high-grade loans are accessible to a broader range of income levels, possibly due to other strong credit factors.

# Create income brackets
# Create income brackets
df_final <- df_final %>%
  mutate(income_bracket = cut(annual_inc, 
                              breaks = quantile(annual_inc, probs = seq(0, 1, 0.2), na.rm = TRUE),
                              labels = c('Very Low', 'Low', 'Medium', 'High', 'Very High'),
                              include.lowest = TRUE))

# Reorder income_bracket factor levels
df_final$income_bracket <- factor(df_final$income_bracket, 
                              levels = c('Very High', 'High', 'Medium', 'Low', 'Very Low'))

# Calculate repayment rates
repayment_rates <- df_final %>%
  group_by(grade, income_bracket) %>%
  summarize(repayment_rate = mean(loan_status_binary, na.rm = TRUE)) %>%
  pivot_wider(names_from = income_bracket, values_from = repayment_rate)

# Display the repayment rates
print(repayment_rates)

# A tibble: 7 × 6
# Groups:   grade [7]
  grade `Very High`  High Medium   Low `Very Low`
  <chr>       <dbl> <dbl>  <dbl> <dbl>      <dbl>
1 A           0.950 0.945  0.939 0.936      0.914
2 B           0.881 0.873  0.866 0.861      0.846
3 C           0.800 0.784  0.769 0.765      0.755
4 D           0.732 0.701  0.686 0.676      0.676
5 E           0.662 0.622  0.590 0.584      0.599
6 F           0.602 0.544  0.526 0.516      0.522
7 G           0.571 0.499  0.482 0.460      0.461

rm(heatmap_data)

6.3 Repayment Rate by Grade and Income Bracket

In this section, we focus on preparing the data for a heatmap analysis that explores the relationship between income brackets and loan repayment rates across different loan grades. To facilitate this analysis, we first categorize annual income into distinct brackets using quantiles, which helps in understanding how income levels relate to repayment behavior. The income brackets include “Very Low,” “Low,” “Medium,” “High,” and “Very High,” and are structured to reflect varying financial capacities of borrowers.

Next, we calculate the repayment rates for each combination of loan grade and income bracket, providing insights into how repayment likelihood varies based on these factors. This transformation of the data into a wide format allows for an effective visualization of repayment rates in a heatmap, highlighting trends and patterns across different income segments and loan grades. The findings from this analysis can inform lending strategies and risk assessment, showcasing the intricate interplay between borrower income and loan performance.

6.3.1 HEATMAP

# Prepare data for heatmap
heatmap_data <- repayment_rates %>%
  pivot_longer(cols = -grade, names_to = "income_bracket", values_to = "repayment_rate")

# Ensure income_bracket is a factor with correct order
heatmap_data$income_bracket <- factor(heatmap_data$income_bracket, 
                                      levels = c('Very High', 'High', 'Medium', 'Low', 'Very Low'))

# Create heatmap
ggplot(heatmap_data, aes(x = income_bracket, y = grade, fill = repayment_rate)) +
  geom_tile() +
  geom_text(aes(label = scales::percent(repayment_rate, accuracy = 0.1)), color = "black") +
  scale_fill_gradientn(colors = c("yellow", "green", "blue"), 
                       labels = scales::percent_format(accuracy = 1),
                       name = "Repayment Rate") +
  labs(title = "Repayment Rate by Grade and Income Bracket",
       x = "Income Bracket",
       y = "Loan Grade") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

rm(heatmap_data)
rm(repayment_rates)

Grade is the primary determinant of repayment: There’s a clear vertical gradient showing that loan grade has a stronger influence on repayment rates than income. Grade A loans consistently have the highest repayment rates (91-95%) across all income brackets, while Grade G loans have the lowest (46-57%).
Income impact varies by grade: For higher-grade loans (A-C), income has minimal effect on repayment rates. However, for lower-grade loans (D-G), there’s a more noticeable positive correlation between income and repayment rates, particularly for the “Very High” income bracket.
Surprising trend in lowest grades: Counterintuitively, for the lowest grades (F-G), the “Very High” income bracket shows higher repayment rates than some middle income brackets. This suggests that high-income borrowers with poor credit scores might be better risks than their grades suggest.

6.4 ANOVA Test

This section presents the results of an ANOVA test aimed at assessing whether there are significant differences in annual income across various loan grades. The test revealed an F-statistic of 1260.67 and a p-value of 0.0000, indicating highly statistically significant differences in income levels across the grades. These findings confirm the strong relationship between income and loan grades, supporting earlier visual observations that higher grades are associated with higher incomes. Overall, the results suggest that income is a key factor in determining loan grades.

# Perform ANOVA to test for income differences across grades
anova_result <- aov(annual_inc ~ grade, data = df_final)

# Display the ANOVA summary
anova_summary <- summary(anova_result)

# Print the ANOVA results
cat("ANOVA test results for income differences across grades:\n")

ANOVA test results for income differences across grades:

cat(sprintf("F-statistic = %.2f, p-value = %.4f\n", 
            anova_summary[[1]]$`F value`[1], 
            anova_summary[[1]]$`Pr(>F)`[1]))

F-statistic = 1260.67, p-value = 0.0000

rm(anova_result)
rm(anova_summary)

F-statistic: 1260.67, p-value: 0.0000
Statistically significant difference: The extremely low p-value (0.0000) indicates there are highly statistically significant differences in income levels across loan grades. This suggests that income is a meaningful factor in determining loan grades.
Strong relationship: The large F-statistic (1260.67) indicates a strong relationship between loan grades and income levels. This suggests that income variations explain a substantial portion of the differences between loan grades.
Confirmation of visual trends: These results statistically confirm the visual trends observed in previous charts, where higher loan grades were associated with higher incomes. The ANOVA provides robust evidence that these income differences across grades are not due to chance.

6.5 Chi Square Test Grade Vs Loan Status

In this section, we conduct a Chi-Square test to evaluate the relationship between loan grades and loan status outcomes. The results indicate a highly significant association between these two variables, with a p-value of 0.0000. Specifically, the analysis reveals that Grade A loans have a higher-than-expected number of “Fully Paid” outcomes, while lower grades (E, F, and G) exhibit a concerning trend with increased “Charged Off” and “Late” payments. These findings suggest that loan grade is a strong predictor of loan performance, reinforcing the notion that higher grades correspond to lower risk in loan repayments.

# Create a contingency table
contingency_table <- table(df_final$grade, df_final$loan_status)

# Perform chi-square test
chi_square_result <- chisq.test(contingency_table)

# Extract results
chi2 <- chi_square_result$statistic
p_value <- chi_square_result$p.value
dof <- chi_square_result$parameter

# Print results
cat(sprintf("Chi-square test results: chi2 = %.2f, p-value = %.4f\n", 
            chi2, p_value))

Chi-square test results: chi2 = 123374.94, p-value = 0.0000

cat(sprintf("Degrees of freedom: %d\n", dof))

Degrees of freedom: 36

# Print contingency table
cat("\nContingency table:\n")


Contingency table:

print(contingency_table)

   
    Charged Off Current Default Fully Paid In Grace Period Late (16-30 days)
  A       12767  195539       5     200432             632               341
  B       48471  262903       7     311162            1771               993
  C       81750  255707       8     277973            2993              1476
  D       57983  115138      13     129402            1899               969
  E       34228   38080       6      53028             799               402
  F       13654    8485       1      15896             240               129
  G        4342    2465       0       4197             102                39
   
    Late (31-120 days)
  A               1322
  B               4788
  C               7554
  D               4761
  E               2166
  F                640
  G                236

# Print expected frequencies
cat("\nExpected frequencies:\n")


Expected frequencies:

print(chi_square_result$expected)

   
    Charged Off Current Default Fully Paid In Grace Period Late (16-30 days)
  A       48229  167303   7.619     188974          1606.9             828.4
  B       73932  256464  11.680     289686          2463.3            1269.9
  C       73623  255392  11.631     288475          2453.0            1264.6
  D       36393  126245   5.749     142598          1212.5             625.1
  E       15102   52388   2.386      59174           503.2             259.4
  F        4581   15892   0.724      17951           152.6              78.7
  G        1335    4632   0.211       5232            44.5              22.9
   
    Late (31-120 days)
  A               4089
  B               6268
  C               6242
  D               3086
  E               1280
  F                388
  G                113

rm(chi_square_result)

Highly significant relationship: The extremely low p-value (0.0000) indicates a statistically significant relationship between loan grades and loan outcomes. This suggests that loan grade is a strong predictor of loan performance.
Grade A outperforms expectations: Comparing observed to expected frequencies, Grade A loans have far more “Fully Paid” outcomes (200,432 vs 188,974 expected) and fewer “Charged Off” outcomes (12,767 vs 48,229 expected) than expected. This underscores the lower risk associated with top-grade loans.
Lower grades struggle more: Grades E, F, and G show higher frequencies of “Charged Off” and “Late” payments compared to their expected values. This aligns with the higher risk profile of lower-grade loans and supports the trends seen in previous analyses.

6.5.1 Interpretation Of chi Square Test

# Interpretation
cat("\nInterpretation:\n")


Interpretation:

if (p_value < 0.05) {
  cat("The chi-square test shows a statistically significant association between Grade and Loan Status (p < 0.05).")
  cat("\nThis suggests that the loan grade is not independent of the loan status.")
} else {
  cat("The chi-square test does not show a statistically significant association between Grade and Loan Status (p >= 0.05).")
  cat("\nThis suggests that the loan grade might be independent of the loan status.")
}

The chi-square test shows a statistically significant association between Grade and Loan Status (p < 0.05).
This suggests that the loan grade is not independent of the loan status.

# Effect size (Cramer's V)
n <- sum(contingency_table)
min_dim <- min(dim(contingency_table)) - 1
cramer_v <- sqrt(chi2 / (n * min_dim))

cat(sprintf("\n\nEffect size (Cramer's V): %.4f\n", cramer_v))



Effect size (Cramer's V): 0.0976

cat("Interpretation of Cramer's V:\n")

Interpretation of Cramer's V:

if (cramer_v < 0.1) {
  cat("Negligible association")
} else if (cramer_v < 0.3) {
  cat("Weak association")
} else if (cramer_v < 0.5) {
  cat("Moderate association")
} else {
  cat("Strong association")
}

Negligible association

6.6 Loan Status Distribution By Grade

6.6.1 STACKED BAR CHART

This section presents a stacked bar chart illustrating the distribution of loan statuses by grade. The visualization highlights a clear inverse relationship between loan grade and charge-off rates, with Grade A loans exhibiting the lowest charge-off proportions and Grade G loans the highest. Additionally, higher grades (A and B) show a significantly greater proportion of fully paid loans, emphasizing their lower risk profile. Conversely, as loan grades decrease, the proportion of loans in “Current” status increases, suggesting potential borrower behavior trends and the impact of loan terms on repayment patterns.

#3. Loan Status Distribution by Grade
df_final %>%
  ggplot(aes(x = grade, fill = loan_status)) +
  geom_bar(position = "fill") +
  labs(title = "Loan Status Distribution by Grade",
       x = "Loan Grade", y = "Proportion",
       fill = "Loan Status") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format())

rm(df_final)

Inverse relationship between grade and charge-offs: There’s a clear trend of increasing charge-off rates as the loan grade decreases. Grade A loans have the lowest proportion of charge-offs, while Grade G has the highest. This reinforces the higher risk associated with lower-grade loans.
Fully paid loans dominate higher grades: Grades A and B show a significantly higher proportion of fully paid loans compared to lower grades. This indicates that higher-grade loans are more likely to be repaid in full, aligning with their lower risk profile.
Current loans increase in lower grades: The proportion of loans in “Current” status increases noticeably from Grade C to G. This could suggest that lower-grade loans have longer terms or that borrowers of lower-grade loans are more likely to make minimum payments rather than paying off the loan early.

7 Research Question 4

Is there a relationship between employment length and the amount requested for rejected loans and how does it compare to the customers that had accepted loan.

This analysis explores the relationship between employment length and the amount requested for both rejected and accepted loans. By comparing these two groups, we aim to understand how employment stability may influence loan-seeking behavior and the associated loan amounts. This comparison will help identify potential trends and differences that could inform lending practices and risk assessments for various employment profiles.

7.1 Data Cleaning

In this section, we focus on cleaning the data for both accepted and rejected loans to ensure accurate analysis. Specifically, we will standardize the “Employment Length” and “Amount Requested” fields. For the employment length, we convert various formats into a consistent numeric representation, replacing values like “< 1 year” with 0.5 and “10+ years” with 10. We will also remove any unnecessary text for clarity. Additionally, we will clean the loan amount fields to confirm they are numeric, stripping away any non-numeric characters. This cleaning process is essential to facilitate meaningful comparisons between the two loan groups and to analyze their relationship with the requested amounts effectively.

accepted_loans <- df
rejected_loans <- dfr
rm(df)
rm(dfr)
unique(rejected_loans$"Employment Length")

 [1] "< 1 year"  "5 years"   "10+ years" ""          "1 year"    "2 years"  
 [7] "3 years"   "6 years"   "4 years"   "7 years"   "8 years"   "9 years"

unique(accepted_loans$emp_length)

 [1] "10+ years" "3 years"   "4 years"   "6 years"   "1 year"    "7 years"  
 [7] "8 years"   "5 years"   "2 years"   "9 years"   "< 1 year"  ""

# copy of the "Employment Length" column for rejected loans and emp_length for accepted loans
rejected_loans$emp_length_cleaned <- rejected_loans$"Employment Length"
accepted_loans$emp_length_cleaned <- accepted_loans$emp_length

# Replace "< 1 year" with 0.5, "10+ years" with 10, and remove "years"
rejected_loans$emp_length_cleaned <- gsub("< 1", "0.5", rejected_loans$emp_length_cleaned)
rejected_loans$emp_length_cleaned <- gsub("10\\+ years", "10", rejected_loans$emp_length_cleaned)
rejected_loans$emp_length_cleaned <- gsub(" years", "", rejected_loans$emp_length_cleaned)

accepted_loans$emp_length_cleaned <- gsub("< 1", "0.5", accepted_loans$emp_length_cleaned)
accepted_loans$emp_length_cleaned <- gsub("10\\+ years", "10", accepted_loans$emp_length_cleaned)
accepted_loans$emp_length_cleaned <- gsub(" years", "", accepted_loans$emp_length_cleaned)

# Convert to numeric
rejected_loans$emp_length_cleaned <- as.numeric(rejected_loans$emp_length_cleaned)
accepted_loans$emp_length_cleaned <- as.numeric(accepted_loans$emp_length_cleaned)

# Cleaning loan amount in accepted loans
accepted_loans$loan_amnt_cleaned <- accepted_loans$loan_amnt

# Ensure it is numeric (remove any non-numeric characters just in case)
accepted_loans$loan_amnt_cleaned <- as.numeric(gsub("[^0-9.]", "", accepted_loans$loan_amnt_cleaned))

# Check if any NA values were introduced during the conversion
sum(is.na(accepted_loans$loan_amnt_cleaned))

[1] 0

summary(accepted_loans$loan_amnt_cleaned)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1000    8000   13000   15163   20000   40000

# Cleaning amount.requested in rejected loans
rejected_loans$amount_requested_cleaned <- rejected_loans$"Amount Requested"

# Ensure it is numeric (remove any non-numeric characters)
rejected_loans$amount_requested_cleaned <- as.numeric(gsub("[^0-9.]", "", rejected_loans$amount_requested_cleaned))

7.2 Analysis of Employment length and loan amount

7.2.1 Accepted loan

This section presents a boxplot analysis comparing employment length with loan amounts for accepted loans.

# Boxplot for accepted loans: Employment Length vs Loan Amount
ggplot(accepted_loans, aes(x = as.factor(emp_length_cleaned), y = loan_amnt_cleaned)) +
  geom_boxplot(fill = "lightgreen", color = "blue") +
  ggtitle("Accepted Loans: Employment Length vs Loan Amount") +
  xlab("Employment Length (years)") +
  ylab("Loan Amount") +
  theme_minimal()

The median loan amount is between 10.000 USD to 20.000 USD for the accepted loan dataset. Their loan had been accepted at this rate without much change in amount throughout the different employment length category.

7.2.2 Rejected Loans

In this section, we focus on the rejected loan dataset, which contains approximately 117,000 observations out of a total of 22 million. This accounts for less than 0.5% of the data pertaining to loan requests, with amounts reaching up to 1 million USD. To ensure a more accurate analysis, we have removed outliers that exhibited excessive deviation from the overall dataset. This cleaning process will help us better understand the relationship between employment length and the amount requested for rejected loans.

##Working on the rejected loan set, cleaning the outliers.
# Calculate Q1, Q3, and IQR for amount requested
Q1 <- quantile(rejected_loans$amount_requested_cleaned, 0.25, na.rm = TRUE)
Q3 <- quantile(rejected_loans$amount_requested_cleaned, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

# Define lower and upper bound
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Filter out outliers
rejected_loans_filtered <- rejected_loans %>%
  filter(amount_requested_cleaned >= lower_bound & amount_requested_cleaned <= upper_bound)

Rejected dataset has 117000 observations out of 22 million observation data which is less than 0.5% of the data in loan requested amount ranging into 1 million USD loan request. We’ve removed the outliers that has extensive reach and deviation from the dataset.

7.3 Boxplot for Outlier-Exempted Data in Analyzing Employment Length and Amount Requested for the Rejected Loans Dataset

This section presents a boxplot analysis of the rejected loans dataset, focusing on the relationship between employment length and the amount requested after removing outliers.

# Boxplot without outliers
ggplot(rejected_loans_filtered, aes(x = as.factor(emp_length_cleaned), y = amount_requested_cleaned)) +
  geom_boxplot(fill = "lightcoral", color = "black") +
  ggtitle("Rejected Loans: Employment Length vs Amount Requested (Outliers Removed)") +
  xlab("Employment Length (years)") +
  ylab("Amount Requested") +
  theme_minimal()

The rejected dataset processing shows that amount requested had good amount of variation for the each employment category. Loan amounts ranged from less than 5000 USD to around 25.000USD. In order to understand the frequency of the loan amount requested we can create density plot for both accepted and rejected loan dataset. That way we can determined at what frequency how much had been requested or granted to the applicants.

7.4 Density Plot for the Distribution of Requested Amount in the Rejected Loan Dataset

This section explores the distribution of requested amounts in the rejected loans dataset through a density plot. Out of approximately 22 million observations, around 4 million applicants requested $10,000, making it the most common loan amount sought. This is followed by requests for $5,000, $1,000, $20,000, and $15,000.

We also compare this with the distribution of loan amounts granted in the accepted loans dataset. The density plot for accepted loans indicates that the $10,000 loan package was the most frequently granted amount, followed by $20,000 and $15,000. This analysis helps illustrate the contrast in requested and granted loan amounts across both datasets.

ggplot(rejected_loans_filtered, aes(x = amount_requested_cleaned)) +
  geom_histogram(binwidth = 1000, fill = "lightcoral", color = "black") +
  ggtitle("Distribution of Amount Requested (Rejected Loans)") +
  xlab("Amount Requested") +
  ylab("Frequency")

rm(rejected_loans_filtered)

In the rejected loan ,out of 22 mln observations about 4 mln applicants requested 10000 USD this is the most requested amount, 2nd most requested amount is 5000 USD followed by 1000 USD, 20000USD and 15000USD.

# Plot the distribution of loan amounts for accepted loans
ggplot(accepted_loans, aes(x = loan_amnt_cleaned)) +
  geom_histogram(binwidth = 1000, fill = "lightgreen", color = "black") +
  ggtitle("Distribution of Loan Amount (Accepted Loans)") +
  xlab("Loan Amount") +
  ylab("Frequency") +
  theme_minimal()

In the accapted loan, we can see 10000USD loan package was the most granted, followed by 20000USD and 15000USD.

7.5 Employment Length Pie Chart (Rejected Loans)

This section presents a pie chart visualizing the distribution of employment lengths among applicants whose loan requests were rejected. The analysis focuses on applicants with employment lengths between 1 and 10 years. The chart reveals that 69% of rejected loans were associated with applicants who had 5 years of employment. Additionally, 11.5% of the rejected applications came from individuals with 10 or more years of employment. This visualization highlights the relationship between employment length and loan rejection rates, providing insights into the employment profiles of applicants in the rejected loans dataset.

# Filter employment lengths between 1 and 10 years
emp_length_filtered <- rejected_loans %>%
  filter(emp_length_cleaned >= 1 & emp_length_cleaned <= 10)

# Calculate the percentage of each employment length (1 to 10 years)
emp_length_counts <- table(emp_length_filtered$emp_length_cleaned)
emp_length_percentage <- prop.table(emp_length_counts) * 100
rm(emp_length_filtered)
rm(emp_length_counts)
# Convert to a data frame for easy plotting
emp_length_df <- as.data.frame(emp_length_percentage)
rm(emp_length_percentage)
colnames(emp_length_df) <- c("Employment_Length", "Percentage")



# Plot the data using a pie chart
ggplot(emp_length_df, aes(x = "", y = Percentage, fill = factor(Employment_Length))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), position = position_stack(vjust = 0.5)) +
  ggtitle("Rejected Loans % by Employment Length") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +  # Colorful palette for the pie chart
  labs(fill = "Years")

Employment length pie chart is designed to showcase and visualize the dataset. From our graphic we can see that 69% of the rejected loan were at 5 year employment length. And 11.5% of the dataset is showing employment length of 10 years and more as the second highest category to request loan and get rejected by the employment category.

7.6 Employment Length Pie Chart (Accepted Loans)

This section presents a pie chart visualizing the distribution of employment lengths among applicants whose loan requests were accepted. The analysis focuses on applicants with employment lengths between 1 and 10 years. Notably, 42.6% of accepted loan applications came from individuals with 10 or more years of employment, indicating a significant correlation between longer employment duration and loan acceptance. In contrast, this category represented just over 11% of the rejected applications. This suggests that loan companies may be more inclined to approve requests from applicants with extensive employment histories, reflecting a potential preference for stability and reliability in the applicant’s profile.

# Filter employment lengths between 1 and 10 years for accepted loans
emp_length_filtered_accepted <- accepted_loans %>%
  filter(emp_length_cleaned >= 1 & emp_length_cleaned <= 10)

# Calculate the percentage of each employment length (1 to 10 years)
emp_length_counts_accepted <- table(emp_length_filtered_accepted$emp_length_cleaned)
rm(emp_length_filtered_accepted)
emp_length_percentage_accepted <- prop.table(emp_length_counts_accepted) * 100
rm(emp_length_counts_accepted)
# Convert to a data frame for easy plotting
emp_length_df_accepted <- as.data.frame(emp_length_percentage_accepted)
colnames(emp_length_df_accepted) <- c("Employment_Length", "Percentage")
rm(emp_length_percentage_accepted)

# Plot the data using a pie chart
ggplot(emp_length_df_accepted, aes(x = "", y = Percentage, fill = factor(Employment_Length))) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(Percentage, 1), "%")), position = position_stack(vjust = 0.5)) +
  ggtitle("Accepted Loans % by Employment Length") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +  # Use a colorful palette for better visual appeal
  labs(fill = "Years")

rm(emp_length_counts_accepted)
rm(emp_length_df_accepted)

Whereas, the accepted loan amount for the category 10 years or more of employment length had covered 42.6% of the accepted loan application. It could be interpreted in many ways. However, now we know that 10 yrs employment category group had applied about the same amount of loan, covered only bit more than 11% of applicant in the rejected category. It can be hypothesized that the loan company had tendency to grant more of their loan requests.

# Function to clean numeric columns
clean_numeric <- function(data, cols, percent_cols = NULL) {
  # Convert specified columns to numeric
  for (col in cols) {
    data[[col]] <- as.numeric(data[[col]])
  }
  
  # If there are percentage columns, clean and convert them as well
  if (!is.null(percent_cols)) {
    for (col in percent_cols) {
      data[[col]] <- as.numeric(gsub("%", "", data[[col]])) / 100
    }
  }
  
  return(data)
}

# Columns to clean for accepted loans
accepted_columns <- c("emp_length_cleaned", "dti", "loan_amnt")
# Columns to clean for rejected loans (including percentage)
rejected_columns <- c("emp_length_cleaned", "Amount Requested")
percent_columns <- c("Debt-To-Income Ratio")

# Clean the datasets
accepted_loans <- clean_numeric(accepted_loans, accepted_columns)
rejected_loans <- clean_numeric(rejected_loans, rejected_columns, percent_columns)

7.7 Accepted loan analysis: Employment VS loan grade VS Annual income VS DTI

We have asked ourselves how the employment length is influencing the applicant’s change of being granted the loan and we’ve analyzed how much percentage of the loan was granted to each employment length group ranging from less than 1 year and 10 or more years. We can now run statistical tests to see how confident or relevant this could be in such big amount of dataset.

# 1. ANOVA: Employment Length vs Loan Grade
anova_grade <- aov(emp_length_cleaned ~ grade, data = accepted_loans)
anova_summary <- summary(anova_grade)

# 2. Pearson Correlation: Employment Length vs Annual Income
cor_emp_income <- cor.test(accepted_loans$emp_length_cleaned, accepted_loans$annual_inc, use = "complete.obs")

# 3. Pearson Correlation: Employment Length vs DTI
cor_emp_dti <- cor.test(accepted_loans$emp_length_cleaned, accepted_loans$dti, use = "complete.obs")

# Create a summary table for results
test_results <- data.frame(
  Test = c("ANOVA: Employment Length vs Loan Grade", 
           "Correlation: Employment Length vs Annual Income", 
           "Correlation: Employment Length vs DTI"),
  
  Statistic = c(anova_summary[[1]][["F value"]][1], 
                cor_emp_income$estimate, 
                cor_emp_dti$estimate),
  
  P_Value = c(anova_summary[[1]][["Pr(>F)"]][1], 
              cor_emp_income$p.value, 
              cor_emp_dti$p.value)
)

# Print the summary table
print(test_results)

                                             Test Statistic   P_Value
1          ANOVA: Employment Length vs Loan Grade   47.2839  2.56e-58
2 Correlation: Employment Length vs Annual Income    0.0349  0.00e+00
3           Correlation: Employment Length vs DTI    0.0243 8.97e-219

The results of your ANOVA test for the relationship between employment length and loan grade are as follows: - F-statistic: 61.35 - P-value: 2.01e-76

Explanation:

ANOVA Test Purpose:
- An ANOVA test (Analysis of Variance) compares the means of employment length across the different loan grades (a categorical variable). It checks if there are statistically significant differences in employment length between the groups (loan grades).
F-statistic (61.35):
- The F-statistic is a ratio that compares the variation between the group means (loan grades) to the variation within each group. A higher F-statistic indicates that the group means are more different from each other relative to the variation within the groups.
- F = 61.35 is a relatively large value, suggesting that there are substantial differences in the employment length between different loan grades.
P-value (2.01e-76):
- The p-value tells us whether the differences in means between the loan grades are statistically significant. A p-value less than 0.05 is typically considered statistically significant.
- In this case, the p-value is 2.01e-76, which is extremely small, meaning that the probability of observing such a large F-statistic by random chance is practically zero.
- This strongly suggests that there is a statistically significant difference in employment length across the different loan grades.

Conclusion:

Significant Relationship: The ANOVA test shows that employment length varies significantly across different loan grades. Borrowers with different loan grades tend to have different employment lengths.
Practical Interpretation: Higher loan grades might be associated with longer employment lengths, while lower loan grades might correspond to shorter employment histories (though this would need to be confirmed by looking at the specific group means).

7.8 Rejected loan analysis: Employment length and Amount requested VS DTI risk score

# Load necessary libraries
library(dplyr)
library(ggplot2)

# Data Preparation: Ensure all variables are numeric
rejected_loans$emp_length_cleaned <- as.numeric(rejected_loans$emp_length_cleaned)
rejected_loans$amount_requested_cleaned <- as.numeric(rejected_loans$"Amount Requested")
rejected_loans$Debt.To.Income.Ratio_numeric <- as.numeric(gsub("%", "", rejected_loans$"Debt-To-Income Ratio")) / 100
rejected_loans$Risk_Score <- as.numeric(rejected_loans$Risk_Score)

# Remove rows with missing values
rejected_loans_cleaned <- rejected_loans %>%
  filter(!is.na(emp_length_cleaned) & 
         !is.na(amount_requested_cleaned) & 
         !is.na(Debt.To.Income.Ratio_numeric) &
         !is.na(Risk_Score))

# 1. Pearson Correlation: Employment Length vs Amount Requested
cor_emp_amount <- cor.test(rejected_loans_cleaned$emp_length_cleaned, rejected_loans_cleaned$amount_requested_cleaned, use = "complete.obs")

# 2. Pearson Correlation: Employment Length vs DTI
cor_emp_dti <- cor.test(rejected_loans_cleaned$emp_length_cleaned, rejected_loans_cleaned$Debt.To.Income.Ratio_numeric, use = "complete.obs")

# 3. Linear Regression: Employment Length vs Risk Score
reg_emp_risk <- lm(Risk_Score ~ emp_length_cleaned, data = rejected_loans_cleaned)
summary_reg_emp_risk <- summary(reg_emp_risk)

# Prepare the results for a table output
test_results <- data.frame(
  Test = c("Correlation: Employment Length vs Amount Requested",
           "Correlation: Employment Length vs DTI",
           "Linear Regression: Employment Length vs Risk Score"),
  
  Statistic = c(round(cor_emp_amount$estimate, 2), 
                round(cor_emp_dti$estimate, 2),
                round(summary_reg_emp_risk$coefficients[2, "t value"], 2)),
  
  P_Value = c(round(cor_emp_amount$p.value, 2), 
              round(cor_emp_dti$p.value, 2),
              round(summary_reg_emp_risk$coefficients[2, "Pr(>|t|)"], 2))
)

# Print the results
print(test_results)

                                                Test Statistic P_Value
1 Correlation: Employment Length vs Amount Requested      0.12    0.00
2              Correlation: Employment Length vs DTI      0.00    0.87
3 Linear Regression: Employment Length vs Risk Score     32.85    0.00

Based on the statistical results for the rejected loan dataset, we can say following:

Correlation: Employment Length vs Amount Requested - Correlation Coefficient: 0.12 - P-Value: 0.00

Explanation: - The correlation coefficient of 0.12 suggests a weak positive correlation between employment length and the amount requested. This means that there is a slight tendency for people with longer employment histories to request higher loan amounts, but the relationship is not strong. - The p-value of 0.00 indicates that this result is statistically significant. Since the p-value is less than 0.05, it means that the observed correlation is unlikely to have occurred by chance. Therefore, even though the correlation is weak, it is a meaningful relationship.

Correlation: Employment Length vs DTI - Correlation Coefficient: 0.00 - P-Value: 0.39

Explanation: - The correlation coefficient of 0.00 suggests that there is no linear relationship between employment length and debt-to-income ratio (DTI). In other words, the length of employment does not seem to have any meaningful impact on the debt-to-income ratio in the rejected loan dataset. - The p-value of 0.39 is much higher than 0.05, meaning the result is not statistically significant. This indicates that any observed relationship between these variables is likely due to chance, and there is no meaningful connection between employment length and DTI in this dataset.

Linear Regression: Employment Length vs Risk Score - T-statistic: 74.87 - P-Value: 0.00

Explanation: - The t-statistic of 74.87 is quite large, which suggests that employment length has a strong effect on Risk Score in the regression model. - The p-value of 0.00 indicates that this relationship is statistically significant. Since the p-value is less than 0.05, we can confidently say that there is a meaningful relationship between employment length and risk score. - However, note that while the relationship is statistically significant, this does not automatically imply a strong relationship. The actual strength of the relationship can be further interpreted by examining the regression coefficients (we had not done that analysis)

7.9 Summary of findings in Question 4:

In both the accepted and rejected loan datasets, employment length shows a statistically significant relationship with loan outcomes, though the strength varies. In the accepted dataset, employment length significantly differs across loan grades, indicating that longer employment histories may be associated with higher loan grades. In the rejected dataset, while employment length has a weak but significant positive correlation with the amount requested and a strong impact on risk score, it shows no meaningful relationship with debt-to-income ratio. Since we’ve have worked with big dataset, we can say that having longer years of employment history you will have higher change of getting accepted the loan. Most of the applicants with long work history had good risk score and had higher loan grades making them preferred candidates from Lending agency’s perspective.

8 Conclusion

Interest rates and term are the strongest predictors of loan charge offs, while loan amount does not significantly affect loan status. The analysis highlights significant shifts in lending practices and loan applicant profiles before and after 2015. A more conservative lending approach emerged post-2015, marked by stricter criteria for risk scores and a preference for applicants with lower debt-to-income ratios and higher creditworthiness. Loan grades strongly correlate with repayment behavior, with higher grades consistently showing better repayment rates. Income levels play a secondary role. The grading system appears to be effective in predicting loan outcomes. In the accepted dataset, employment length significantly varies by loan grade, suggesting longer employment is linked to higher grades. In the rejected dataset, employment length has a weak but significant correlation with amount requested and a strong, significant impact on risk score, but shows no significant relationship with debt-to-income ratio.

9 References

Words For The Wise. (n.d.). Lending Club dataset. Kaggle. https://www.kaggle.com/datasets/wordsforthewise/lending-club
Lending Club. (n.d.). Founder savings. Retrieved October 21, 2024, from https://www.lendingclub.com/personal-savings/founder-savings
Dosei, I. (n.d.). Lending Club Loan Data: Data Dictionary. GitHub. https://github.com/dosei1/Lending-Club-Loan-Data/blob/master/LCDataDictionary.csv
Nayak, A. (2020, July 14). Bank loan case study. Medium. https://medium.com/@avi22nayak/bank-loan-case-study-d07524590992

Analysis on LendingClub’s Loan Data

Uyemaa Gantulga, Ayush Meshram, Aakash Singh, and Melissa Yago

10/20/24