Loan Prediction Using Machine Learning

Buriihenry
5 min readMar 6, 2021

ML Pipeline

A loan is a sum of money that one or more individuals or companies borrow from banks or other financial institutions so as to financially manage planned or unplanned events. In doing so, the borrower incurs a debt, which he has to pay back with interest and within a given period of time.

Suppose we want to predict if a borrower is eligible for loan? Well, the process of determining if a borrower is eligible for loan is often a tedious work hence the need to automate this process. Using the customer details and previous borrowing history we would be able to predict if the borrower will repay the loan or not. Utilizing the likes of pandas, matplotlib, & seaborn libraries from python would help us extract much needed insights for this problem

Problem Statement the aim is to predict whether the loan applicant can repay the loan or not using voting assembling techniques of combining the predictions from multiple machine learning algorithms. Such a case is often a a classification problem.

You can download the dataset from here. First register then you will be prompted to download the dataset

https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

Loan_ID : Unique Loan ID
Gender : Male/ Female
Married : Applicant married (Y/N)
Dependents : Number of dependents
Education : Applicant Education (Graduate/ Under Graduate)
Self_Employed : Self employed (Y/N)
ApplicantIncome : Applicant income
CoapplicantIncome : Coapplicant income
LoanAmount : Loan amount in thousands of dollars
Loan_Amount_Term : Term of loan in months
Credit_History : credit history meets guidelines yes or no
Property_Area : Urban/ Semi Urban/ Rural
Loan_Status : Loan approved (Y/N) this is the target variable
This table shows the column names and their descriptions

Importing relevant libraries

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import numpy as np

Screen 1 -Loading our data and checking the first 5 rows
Using Describe we can see the summary of our Dataframe

Exploratory Data Analysis

Using Seaborn to check the distribution of the numerical variables :Applicant income and the loan amount columns

sns.distplot(train.ApplicantIncome,kde=False)

Plotting Applicant Column- Few Outliers Noted

sns.distplot(train.ApplicantIncome.dropna(),kde=False) -

Used dropna fuction as Appliacant Income cloumn had missing values

People with better education should normally have a higher income, we can check that by plotting the education level against the income.

Graduates have more outliers which means that the people with huge income are most likely well educated.

to check how Credit History affects the Loan Status. A value close to 1 indicates a high loan success rate

Data preprocessing:

Dealing with missing values- check using isnull function

For numerical values a good solution is to fill missing values with the mean. for categorical we can fill them with the mode

Next is to handle and treat Outliers

An outlier is a data point in a data set that is distant from all other observation. A data point that lies outside the overall distribution of dataset

In order to treat outliers we use log transform to nullify their effect or remove them permanently. Many Authors have suggested the best way to deal with outliers is not to delete them from your dataset.

In our dataset Some people might have a low income but strong CoappliantIncome so a good idea is to combine them in a TotalIncome column.

train[‘LoanAmount_log’]=np.log(train[‘LoanAmount’])
train[‘TotalIncome’]= train[‘ApplicantIncome’]+train[‘CoapplicantIncome’]
train[‘TotalIncome_log’]=np.log(train[‘TotalIncome’])

After plotting the TotalIncome_log column our data appears as normal distribution

Modelling Using Sklearn

We need to turn all the categorical variables into numbers. We’ll do that using the LabelEncoder

we’ll create a function that takes in a model , fits it and measures the accuracy which means using the model on the train set and measuring the error on the same set .

We first import Sklearn modules

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics

Then write our function

Using Logistic Regression

Using Decision Tree

Using Random Forest

The model is giving us good score on accuracy but a low score in cross validation , this a good example of over fitting

Solutions : Reducing the number of predictors or Tuning the model parameters.

Conclusion

From the Exploratory Data Analysis(EDA), we could generate insight from the data. How each of the features relates to the target. Also, it can be seen from the evaluation of three models that Random Forest performed better than Logistic Regression and Decision Tree.

--

--