Contributed by Jean-Francois Darre. Jean took NYC Data Science Academy 12 week full time Data Science Bootcamp p… between Sept 23 to Dec 18, 2015. The post was based on his second class project(due at 4th week of the program).
Please see the app here! You can also find the code of the app here.
Introduction:
Lending Club (LC) is a peer to peer online lending platform. It is the world’s largest marketplace connecting borrowers and investors, where consumers and small business owners lower the cost of their credit and enjoy a better experience than traditional bank lending, and investors earn attractive risk-adjusted returns.
How it works:
- Customers interested in a loan complete a simple application at LendingClub.com
- LC leverage online data and technology to quickly assess risk, determine a credit rating and assign appropriate interest rates.
- Qualified applicants receive offers in just minutes and can evaluate loan options with no impact to their credit score
- Investors ranging from individuals to institutions select loans in which to invest and can earn monthly returns
- The entire process is online, using technology to lower the cost of credit and pass the savings back in the form of lower rates for borrowers and solid returns for investors.
Here is the link to more details about Lending Club.
The app:
We build 2 mains tools to explore and run simulations on the data provided quarterly by Lending Club.
The first analysis:
For our first project, we already did some analysis on this data.
You can find the blog post here and the full R publication here!
Data exploration and visualization:
The first tool’s main focus is to allow users to explore the data both visually and with data frames summaries. For the visuals, we decided to use bubble graphs as they have the advantages of enabling the visualisation of 4 dimensions in a user friendly and accessible way:
One discrete dimensions, or groups, represented by each bubble and three continuous variables, for the abscissa, the ordinate and the size of the bubbles.
The groups available are:
LC_Grade, the LC grades range from A to G
Home_Ownership, the home ownership status of the applicant: owner, rent, mortgage or other
Purpose, the purpose of the loan: education, small business, debt or purchase
Delinquencies_bucket, the number of delinquencies of the applicant
Inquiries_bucket, the number of inquiries the applicant has made in the past 6 months
Public_Record_bucket, the number of public records
Annual_Income_qbucket, the annual income has been bucketed in 10% quantiles
DTI_qbucket, the DTI is the Debt To Income ratio also bucketed in 10% quantiles
Revol_Util_qbucket, the utilization percentage of the revolving balance available to the applicant
Revol_Bal_qbucket, the size of the revolving balance
Total_Accounts_qbucket, the total number of accounts the applicant has ever opened
Open_Accounts_qbucket, the number of accounts still active
Credit_Age_qbucket, how long is the credit history of the applicant
The continuous variables available are:
LC_score, we converted the sub-grades that range from A1, A2, to G5 to numbers
FICO_score, the usual FICO score ranging from 660 to 800+
Defaults, the default percentage of the selected group
Avg_Loan_Amount, the average loan amount
Loan_Amount_in_mm, the total loan amount
Term, the average term
Interest, the average interest rate
Employment_Length, the average employment length
Annual_Income, the average annual income
DTI, the average debt to income ratio
Delinquency_2yrs, the number of delinquencies in the past 2 years
Credit_Age, the average age to the credit history
Inquieries_6mths, the average number of inquiries made in the 6 month before the application
Number_of_Accounts, the average number of accounts
Public_Records, the average number of public records
Revolving_Balance, the average size of the revolving balance
Revolving_Utilized, the average utilization of the revolving balance
Finally the user can generate these graphs on different sub-groups of Lending Club’s data:
ALL, the entire data set
Matured Only, only on loans that have matured of would have matured in case of default
Survived, all loans that have survived
Defaulted, all loans that have defaulted
Current, all the current loans
Using these options the user can create this type of visualisation:
For the user’s convenience we added 2 visualizations:
One visualisation summarizing the total amount of loans issued for the group selected by the user to put things in perspective especially if the groups differ greatly:
And an other visualization showing the number of loans issued in each group selected by the user. This is particularly useful to detect if there is a bias of size, i.e. if loan in a given group tend to be bigger. For example, here we can see that higher the income, the bigger the loans are: indeed up to a factor of 3 between the lowest quantile and the top quantile:
Finally, we added a summary in the form of a data frame to have access to the basic statistics of each group:
Investment simulation:
The second tool we built is the investment simulator. It allows the user to run investment scenarios based on LC’s historical data. The user can tune and filter the loans in order to select a subset of loans with specific properties that the user suspect, thanks to his previous analysis using the first tool, will outperform Lending Club’s rating system and hence improve the performance of their portfolio.
The user can filter/select the loans using filters on:
- The loans’ details (size, LC rating, FICO score, interest rate, term and purpose)
- The borrower’s personal information (number of inquiries in past 6 months prior to the application, annual income, DTI, employment length and home ownership status)
- The borrower’s financial details (number of delinquencies in the past 2 years at the time of the application, number of public records, the age of the credit history, revolving balance, the utilization of the revolving balance and the number of accounts)
The user can also set some parameters for the investments:
- Amount to be invested
- Maximum amount per loan
- Start date of the simulation
- Proportion of the revenues (interests + principal repayments) that should be re-invested to purchase more loans
- The interest rates the user thinks he could get on his cash
Additional features:
- The user can also set the seed of the randomization to ensure he is testing his assumptions fairly.
- The user can save the settings of his current investment strategy and will be able to load them again in the future
- If satisfied with a strategy, the user can also submit his strategy which will be published on a public leaderboard
After running a simulation, the user has access to the following outputs:
Visualization of the investment over time:
Statistics on the portfolio:
This table contains statistics one almost all feature of the loans and borrowers in the portfolio. This table can be transposed to see more details by checking the box ‘transpose summary’ just above the ‘invest!’ button:
Regular/compact summary:
Transposed summary:
Finally the user has full access to his portfolio and can choose to display all the loans or filter them in anyway he/she may want:
Our results:
We manage to improve the returns from an average of ~6% with no selection to ~12% with our best strategy! This is a huge increase and well over-performs the market standard in fixed income.
These are example of funds boasting their superior investment performances:
As we see the performance achievable on Lending Club’s platform exceeds these industry averages on portfolios of great diversity. Indeed one of our ~12% performing strategy ended with 584 loans!
For our next steps we wish to implement machine learning algorithms to build strategies automatically but with taking great precautions in avoiding overfit to our data.
Additional details:
Here is the hall of fame:
The code is included in the app for the curious mind:
You can also find the code of the app here.
The code for the investment simulation is split over 2 functions:
One function keeps track and build the portfolio and the second function is used on every cycle to simulate the purchasing of loan with the available cash:
invest = function(my_data, to_invest, t, re_invest, max_amount, cash_rate, seed) {
# just transforming date format, adding 2 columns to our data to keep track of
# the investments and create a range to iterate upon
t = as.numeric(format(t, "%Y%m"))
my_data = mutate(my_data, invested = 0, pymnt = 0)
range = sort(unique(pmax(range,t)))
# initializing portfolio which will store the loans we buy
# initializing the summary of our investments, it will help us keep track of
# our cash available on each period potential purchases
portfolio = c()
summary = c()
summary$time = t
summary$invested = 0
summary$received = 0
summary$Reinvested = 0
summary$Cash = 0
summary$Principal = 0
summary = as.data.frame(summary)
i = 0
# This is just filling the loading bar in the top of the screen when running the function
withProgress(message = ‘Running simulation…’, min = 0, max = length(range) + 10, {
# main loop!
for (t in range) {
i = i + 1
incProgress(1, detail = paste0(“Period: “, i) )
summary_temp = c()
summary_temp[1] = i
# updating the portfolio and summary
if (!is.null(portfolio)) {
# payments are: if this the last payment and the loan is fully paid then you get your principal
# (you have to consider prepayments, i.e. people paying back the loan before the scheduled end)
# if not you get the last payment and the recoveries collected and the loan is over
# if before the last payment then you get the portion of the installment that is owed to you
# for the principal, we just adjust it by the portion of the installment – the interest payment
portfolio = mutate(
portfolio,
pymnt = ifelse(
last_pymnt_ym == t,
ifelse(loan_status_new == “Fully Paid”,
prncpl + invested / loan_amnt * recoveries,
invested / loan_amnt * (last_pymnt_amnt + recoveries)),
ifelse(last_pymnt_ym > t, invested / loan_amnt * installment, 0)),
prncpl = ifelse(last_pymnt_ym > t, prncpl –
(installment*invested/loan_amnt – prncpl*(rate/1200)), 0)
)
# to_invest is incremented by the amount collected this month (keeping in mind people
# might not want to ‘re_invest’ everything. We also update the summary
to_invest = to_invest + re_invest/100 * sum(portfolio$pymnt)
summary_temp[2] = sum(portfolio$invested)
summary_temp[3] = sum(portfolio$pymnt)
summary_temp[4] = to_invest #(re_invest/100) * sum(portfolio$pymnt) +
tail(summary$Reinvested, 1) * (1 + cash_rate / 100)^(1/12)
summary_temp[5] = (1 – re_invest/100) * sum(portfolio$pymnt) +
tail(summary$Cash,1) * (1 + cash_rate / 100)^(1/12)
summary_temp[6] = sum(portfolio$prncpl)
} else {
# first loop here, just initializing summary_temp
summary_temp[2] = to_invest
summary_temp[3] = 0
summary_temp[4] = 0
summary_temp[5] = 0
summary_temp[6] = 0
}
# update the summary with this month’s summary
summary = rbind(summary, summary_temp)
# we filter the data to only the loans available this month, if none are available we move on
data = filter(my_data, issue_ym == t)
if (nrow(data) == 0) { next }
# now we buy some additional new loan with our to_invest money
purchase = buy(data, to_invest, max_amount, seed)
# after buying we update our to_invest and portfolio
to_invest = purchase$to_invest_next
portfolio = rbind(portfolio, purchase$purchased)
}
})
# that’s it, now we post the results!
summary = summary[2:nrow(summary),]
result = list(portfolio = portfolio,
portfolio_short = portfolio[,c(“id”,”issue_d”,”loan_amnt”,
“term”,”rate”,”grade”,”Purpose”,
“invested”,”loan_status_new”)],
summary = summary)
return(result)
}
Here is the buy function that we called in our invest function:
# function made to process the purchase of loans for each period
buy = function(data, to_invest, max_amount, seed) {
# intializing variable to keep track of the loans we purchased and this
# is where the 'seed' from the UI is used
n = nrow(data)
purchased_loans = c()
set.seed(seed)
# we just go thru the filtered data, pick a loan at random, buy it if we
# have enough money otherwise leave. If we buy reduce to_invest.
# If we are out of loans, move on too… et voila!
for (i in 1:n) {
if (to_invest < max_amount) { break }
select = ceiling(runif(1)*nrow(data))
temp = data[select,]
invest = pmin(max_amount, temp$loan_amnt)
temp$invested = invest
temp$prncpl = invest
to_invest = to_invest – invest
purchased_loans = rbind(purchased_loans, temp)
data = data[-select,]
if (nrow(data) == 0) { break }
}
Please see the app here! You can also find the code of the app here.