Predicting the price of used cars
Today’s world is dominated by technology and shift towards AI, Machine Learning, connecting people, objects and processes is on a rise. Basic code can help you automate systems, discover patterns, create software, and content which promises to be an asset for companies of any size. Do you like to reminisce on memories with old friends? Most likely YES! Growing up, we all wanted to own a car someday because it felt as a rite of passage. Buying a car especially a new one, don’t come cheap. Due to the high cost implication, majority tend to go for the second choice which is “second hand cars” . In the process of buying or selling that car, you can either pay too much or sell less than its market value.
In this project we are going to predict the price of used cars using machine learning techniques. Afterwards, I will show you how to deploy this project to production using Flask. The dataset can be downloaded from Kaggle. link below
Let’s get a bit technical with code
Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import warnings
Loading the dataset to our DataFrame
df = pd.read_csv('Data/used_carv2.csv')
df.head()
Preprocessing
We conclude that we don’t have any outliers as the values are gradually increasing!
Check for any missing values in data set
Feature Extraction
Creating a new feature called Car_age because It’s important to know how many years old the car is.
Data Exploration & Visualization
plt.figure(figsize=[17,5])
plt.subplot(1,3,1)
sns.barplot(x = 'Seller_Type',
y = 'Selling_Price',
data = df)
plt.title('Selling Price Vs Seller Type')
plt.subplot(1,3,2)
sns.barplot(x = 'Transmission',
y = 'Selling_Price',
data = df)
plt.title('Selling Price Vs Transmission')
plt.subplot(1,3,3)
sns.barplot(x = 'Fuel_Type',
y = 'Selling_Price',
data = df)
plt.title('Selling Price Vs Fuel Type')
- Cars sold by dealers are more expensive than cars sold by individuals
- It can be observed that Selling Price would be higher for cars that are Automatic.
- Selling Price of cars with Fuel Type of Diesel is higher than Petrol and CNG
Dealing With Categorical Variables
Checking Multicollinearity Using VIF
VIF measures the strength of the correlation between the independent variables in regression analysis. This correlation is known as multicollinearity, which can cause problems for regression models.
Feature Selection
Feature selection simplified models, improves speed and prevent a series of unwanted issues arising from having many features
Draw a heatmap to check correlation in our features
plt.figure(figsize=[15,7])
sns.heatmap(data_no_multicolinearity.corr(), annot=True)
After checking our P-values from the heatmap we derive the F-regression
#This method would calculate the F statistic for each of those regressions and return the respective p values
from sklearn.feature_selection import f_regression,SelectKBest
X = data_no_multicolinearity.drop('Selling_Price',axis=1)
y = data_no_multicolinearity['Selling_Price']
f_regression(X,y)
p_values = f_regression(X,y)[1]
p_values.round(3)
Feature Importance
Feature importance gives you a score for each feature of your data, the higher the score the more important or relevant is the feature towards our Target variable
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,y)
Model Development
X = final_df.drop('Selling_Price', axis=1)
y = final_df['Selling_Price']
# feature scallng on training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X[['Present_Price','Car_age']])
input_scaled = scaler.transform(X[['Present_Price','Car_age']])
scaled_data = pd.DataFrame(input_scaled, columns=['Present_Price','Car_age'])
X_scaled =scaled_data.join(X.drop(['Present_Price','Car_age'],axis=1))
Splitting our data into Train and Test
#80 vs 20
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X_scaled,y,test_size=0.2, random_state=365)
Model Building
“Random Forest Regressor Model” is giving better accuracy compared to other two
Conclusions
Present price of a car plays an important role in predicting Selling Price, One increases the other gradually increases.
Cars of Manual type are less priced compared toAutomatic type.
Cars sold by Individual tend to get less Selling Price when sold by Dealers.
Link to this notebook and deployment of this project using Flask can be found below: