Exploratory Data Analysis (EDA): Initial exploration was conducted to understand the dataset's structure and contents.
- The dataset contains 301 rows and 9 columns.
- The columns include
Car_Name
,Year
,Selling_Price
(target variable),Present_Price
,Kms_Driven
,Fuel_Type
,Seller_Type
,Transmission
, andOwner
. - Data types were checked, identifying numerical features (like
Year
,Selling_Price
) and categorical features (Fuel_Type
,Seller_Type
,Transmission
,Car_Name
) stored as objects. - A check for missing values confirmed that the dataset is complete, with no null entries in any column.
- The distribution of key categorical features was examined:
Fuel_Type
: Primarily Petrol, followed by Diesel, with a very small number of CNG cars.Seller_Type
: More cars listed by Dealers than by Individuals.Transmission
: Predominantly Manual transmission cars. While the notebook did not include extensive EDA visualizations (like histograms or correlation heatmaps), these initial checks provided essential insights into the data's characteristics and confirmed its suitability for modeling.
Preprocessing: To prepare the data for machine learning algorithms, the following preprocessing steps were performed:
- Encoding Categorical Features: The categorical features
Fuel_Type
,Seller_Type
, andTransmission
were converted into numerical representations using label encoding via thereplace
method.Fuel_Type
: Petrol -> 0, Diesel -> 1, CNG -> 2Seller_Type
: Dealer -> 0, Individual -> 1Transmission
: Manual -> 0, Automatic -> 1
- Feature Selection: The
Car_Name
column was dropped as it likely contains too many unique values to be effectively used as a feature without more complex processing. The target variable,Selling_Price
, was separated from the features. - Feature Matrix (X) and Target Vector (Y): The final feature matrix
X
included all columns exceptCar_Name
andSelling_Price
. The target vectorY
contained only theSelling_Price
.
1. Linear Regression:
A standard Linear Regression model (LinearRegression
from scikit-learn) was instantiated. This model aims to find the best linear relationship between the features and the target variable. The model was trained using the .fit()
method on the training data.
2. Lasso Regression:
A Lasso Regression model (Lasso
from scikit-learn) was also instantiated and trained. Lasso (Least Absolute Shrinkage and Selection Operator) is a linear model that performs L1 regularization. This regularization adds a penalty equal to the absolute value of the magnitude of coefficients, which can lead to some coefficients becoming exactly zero, effectively performing feature selection. It was also trained using the .fit()
method.
- Linear Regression:
- Training Data R² Score: Approximately 0.880 (calculated as 0.8799 in the notebook)
- Test Data R² Score: (Code exists, but output not shown in notebook snapshot - typically slightly lower than training R²)
- Lasso Regression:
- Training Data R² Score: Approximately 0.843 (calculated as 0.8427 in the notebook)
- Test Data R² Score: (Code exists, but output not shown in notebook snapshot - typically slightly lower than training R²)
Scatter plots comparing the actual selling prices (Y_train
and Y_test
) against the predicted prices generated by each model were created. For both models, the plots showed points clustering reasonably well around the diagonal line (y=x), visually indicating that the predictions generally align with the actual values, especially for lower-priced cars. Some divergence was observed for higher-priced vehicles, suggesting the models might be less accurate for more expensive cars.
- Performance: The Linear Regression model achieved a higher R² score on the training data (0.880) compared to the Lasso Regression model (0.843). This suggests that, on the data it was trained on, the standard Linear Regression model captured slightly more variance.
- Regularization: Lasso Regression includes L1 regularization, which can help prevent overfitting and perform automatic feature selection by shrinking some feature coefficients to zero. While Linear Regression might achieve a higher training score, it can be more prone to overfitting if features are highly correlated or irrelevant. The slightly lower training score for Lasso might indicate a more generalized model, although the test R² scores are needed for a definitive comparison of generalization performance.
- Complexity: Both models are relatively simple linear models. Lasso adds the complexity of a regularization hyperparameter (alpha), which was used with its default value in the notebook.
Best Performing Model: Without the explicit R² scores on the test set, it's difficult to definitively declare the best model for generalization. However, Linear Regression showed a stronger fit on the training data. If the test scores were similar, Linear Regression might be slightly preferred based on the training performance shown. If Linear Regression showed a significant drop in performance on the test set compared to Lasso, then Lasso might be considered more robust due to its regularization. Given the high training R² score for Linear Regression and the nature of the data, it appears to perform very well. Lasso provides a good alternative, potentially offering better stability if overfitting were a concern. Note: The prompt guidelines mentioned comparing three models, but only Linear and Lasso Regression were implemented in the provided notebook.
This project successfully demonstrated the application of machine learning for predicting used car selling prices. Data preprocessing steps, including encoding categorical variables, were crucial for preparing the data. Both Linear Regression and Lasso Regression models were trained and evaluated, showing strong performance on the training data with R² scores above 0.84. The Linear Regression model achieved a slightly higher R² score on the training set. Visualizations confirmed the models' predictive capabilities, particularly for lower-priced cars.While both models performed well, having the test set evaluation metrics would provide a clearer picture of their generalization ability. Future improvements could involve exploring feature engineering (e.g., creating a 'Car_Age' feature from 'Year'), tuning the Lasso regularization parameter (alpha), trying other regression algorithms (like Ridge, Random Forest Regressor, or Gradient Boosting), and performing more in-depth exploratory data analysis with visualizations to uncover deeper insights into feature relationships.
- VS Code
- Jupyter Notebook
- Python 3.12+
Heres repo containing code and notebook https://github.com/Ahemtan/car-price-prediction