Predicting the price of used cars

Buriihenry
5 min readDec 1, 2022

Today’s world is dominated by technology and shift towards AI, Machine Learning, connecting people, objects and processes is on a rise. Basic code can help you automate systems, discover patterns, create software, and content which promises to be an asset for companies of any size. Do you like to reminisce on memories with old friends? Most likely YES! Growing up, we all wanted to own a car someday because it felt as a rite of passage. Buying a car especially a new one, don’t come cheap. Due to the high cost implication, majority tend to go for the second choice which is “second hand cars” . In the process of buying or selling that car, you can either pay too much or sell less than its market value.

In this project we are going to predict the price of used cars using machine learning techniques. Afterwards, I will show you how to deploy this project to production using Flask. The dataset can be downloaded from Kaggle. link below

Let’s get a bit technical with code

Importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import warnings

Loading the dataset to our DataFrame

df = pd.read_csv('Data/used_carv2.csv')
df.head()

Preprocessing

To check if there are any outliers

We conclude that we don’t have any outliers as the values are gradually increasing!

Check for any missing values in data set

to check the missing values we use isna() function

Feature Extraction

Creating a new feature called Car_age because It’s important to know how many years old the car is.

Data Exploration & Visualization

plt.figure(figsize=[17,5])
plt.subplot(1,3,1)
sns.barplot(x = 'Seller_Type',
y = 'Selling_Price',
data = df)
plt.title('Selling Price Vs Seller Type')

plt.subplot(1,3,2)
sns.barplot(x = 'Transmission',
y = 'Selling_Price',
data = df)
plt.title('Selling Price Vs Transmission')

plt.subplot(1,3,3)
sns.barplot(x = 'Fuel_Type',
y = 'Selling_Price',
data = df)
plt.title('Selling Price Vs Fuel Type')
Visualizing our data. More visuals in the notebook with link provided below
  • Cars sold by dealers are more expensive than cars sold by individuals
  • It can be observed that Selling Price would be higher for cars that are Automatic.
  • Selling Price of cars with Fuel Type of Diesel is higher than Petrol and CNG

Dealing With Categorical Variables

Droping Car_name column and implementing pd.get_dummies function

Checking Multicollinearity Using VIF

VIF measures the strength of the correlation between the independent variables in regression analysis. This correlation is known as multicollinearity, which can cause problems for regression models.

Car_age and fuel_type_petrol feature have high VIF

Feature Selection

Feature selection simplified models, improves speed and prevent a series of unwanted issues arising from having many features

Draw a heatmap to check correlation in our features

plt.figure(figsize=[15,7])
sns.heatmap(data_no_multicolinearity.corr(), annot=True)

After checking our P-values from the heatmap we derive the F-regression

#This method would calculate the F statistic for each of those regressions and return the respective p values
from sklearn.feature_selection import f_regression,SelectKBest

X = data_no_multicolinearity.drop('Selling_Price',axis=1)
y = data_no_multicolinearity['Selling_Price']

f_regression(X,y)

p_values = f_regression(X,y)[1]

p_values.round(3)
Dropping the “Owner” feature which has p-value>0.05 resulting in insigificant.

Feature Importance

Feature importance gives you a score for each feature of your data, the higher the score the more important or relevant is the feature towards our Target variable

from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,y)
Selecting our features and visualizing the top rows

Model Development

X = final_df.drop('Selling_Price', axis=1)
y = final_df['Selling_Price']
# feature scallng on training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X[['Present_Price','Car_age']])
input_scaled = scaler.transform(X[['Present_Price','Car_age']])
scaled_data = pd.DataFrame(input_scaled, columns=['Present_Price','Car_age'])

X_scaled =scaled_data.join(X.drop(['Present_Price','Car_age'],axis=1))

Splitting our data into Train and Test

#80 vs 20
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X_scaled,y,test_size=0.2, random_state=365)

Model Building

“Random Forest Regressor Model” is giving better accuracy compared to other two

Conclusions

Present price of a car plays an important role in predicting Selling Price, One increases the other gradually increases.
Cars of Manual type are less priced compared toAutomatic type.
Cars sold by Individual tend to get less Selling Price when sold by Dealers.

Link to this notebook and deployment of this project using Flask can be found below:

--

--