Loan Prediction Using Machine Learning
ML Pipeline
A loan is a sum of money that one or more individuals or companies borrow from banks or other financial institutions so as to financially manage planned or unplanned events. In doing so, the borrower incurs a debt, which he has to pay back with interest and within a given period of time.
Suppose we want to predict if a borrower is eligible for loan? Well, the process of determining if a borrower is eligible for loan is often a tedious work hence the need to automate this process. Using the customer details and previous borrowing history we would be able to predict if the borrower will repay the loan or not. Utilizing the likes of pandas, matplotlib, & seaborn libraries from python would help us extract much needed insights for this problem
Problem Statement the aim is to predict whether the loan applicant can repay the loan or not using voting assembling techniques of combining the predictions from multiple machine learning algorithms. Such a case is often a a classification problem.
You can download the dataset from here. First register then you will be prompted to download the dataset
https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
Loan_ID : Unique Loan ID
Gender : Male/ Female
Married : Applicant married (Y/N)
Dependents : Number of dependents
Education : Applicant Education (Graduate/ Under Graduate)
Self_Employed : Self employed (Y/N)
ApplicantIncome : Applicant income
CoapplicantIncome : Coapplicant income
LoanAmount : Loan amount in thousands of dollars
Loan_Amount_Term : Term of loan in months
Credit_History : credit history meets guidelines yes or no
Property_Area : Urban/ Semi Urban/ Rural
Loan_Status : Loan approved (Y/N) this is the target variableThis table shows the column names and their descriptions
Importing relevant libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import numpy as np
Exploratory Data Analysis
Using Seaborn to check the distribution of the numerical variables :Applicant income and the loan amount columns
sns.distplot(train.ApplicantIncome,kde=False)
sns.distplot(train.ApplicantIncome.dropna(),kde=False) -
Used dropna fuction as Appliacant Income cloumn had missing values
People with better education should normally have a higher income, we can check that by plotting the education level against the income.
Graduates have more outliers which means that the people with huge income are most likely well educated.
to check how Credit History affects the Loan Status. A value close to 1 indicates a high loan success rate
Data preprocessing:
Dealing with missing values- check using isnull function
For numerical values a good solution is to fill missing values with the mean. for categorical we can fill them with the mode
Next is to handle and treat Outliers
An outlier is a data point in a data set that is distant from all other observation. A data point that lies outside the overall distribution of dataset
In order to treat outliers we use log transform to nullify their effect or remove them permanently. Many Authors have suggested the best way to deal with outliers is not to delete them from your dataset.
In our dataset Some people might have a low income but strong CoappliantIncome so a good idea is to combine them in a TotalIncome column.
train[‘LoanAmount_log’]=np.log(train[‘LoanAmount’])
train[‘TotalIncome’]= train[‘ApplicantIncome’]+train[‘CoapplicantIncome’]
train[‘TotalIncome_log’]=np.log(train[‘TotalIncome’])
After plotting the TotalIncome_log column our data appears as normal distribution
Modelling Using Sklearn
We need to turn all the categorical variables into numbers. We’ll do that using the LabelEncoder
we’ll create a function that takes in a model , fits it and measures the accuracy which means using the model on the train set and measuring the error on the same set .
We first import Sklearn modules
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
Then write our function
Using Logistic Regression
Using Decision Tree
Using Random Forest
The model is giving us good score on accuracy but a low score in cross validation , this a good example of over fitting
Solutions : Reducing the number of predictors or Tuning the model parameters.
Conclusion
From the Exploratory Data Analysis(EDA), we could generate insight from the data. How each of the features relates to the target. Also, it can be seen from the evaluation of three models that Random Forest performed better than Logistic Regression and Decision Tree.