Car Price Prediction Analysis

Introduction

This report details the process of developing and evaluating machine learning models to predict the selling price of used cars. Using a dataset containing various features of cars, such as year, present price, kilometers driven, and fuel type, we performed data preprocessing, exploratory data analysis, model training, and evaluation. Two regression models, Linear Regression and Lasso Regression, were implemented and compared to determine the most effective approach for this prediction task.

Data Collection and Processing

**Data Loading:** The analysis began by loading the dataset from a CSV file named `data.csv` into a pandas DataFrame. This DataFrame served as the primary structure for data manipulation and analysis.

Exploratory Data Analysis (EDA): Initial exploration was conducted to understand the dataset's structure and contents.

The dataset contains 301 rows and 9 columns.
The columns include Car_Name, Year, Selling_Price (target variable), Present_Price, Kms_Driven, Fuel_Type, Seller_Type, Transmission, and Owner.
Data types were checked, identifying numerical features (like Year, Selling_Price) and categorical features (Fuel_Type, Seller_Type, Transmission, Car_Name) stored as objects.
A check for missing values confirmed that the dataset is complete, with no null entries in any column.
The distribution of key categorical features was examined:
- Fuel_Type: Primarily Petrol, followed by Diesel, with a very small number of CNG cars.
- Seller_Type: More cars listed by Dealers than by Individuals.
- Transmission: Predominantly Manual transmission cars. While the notebook did not include extensive EDA visualizations (like histograms or correlation heatmaps), these initial checks provided essential insights into the data's characteristics and confirmed its suitability for modeling.

Preprocessing: To prepare the data for machine learning algorithms, the following preprocessing steps were performed:

Encoding Categorical Features: The categorical features Fuel_Type, Seller_Type, and Transmission were converted into numerical representations using label encoding via the replace method.
- Fuel_Type: Petrol -> 0, Diesel -> 1, CNG -> 2
- Seller_Type: Dealer -> 0, Individual -> 1
- Transmission: Manual -> 0, Automatic -> 1
Feature Selection: The Car_Name column was dropped as it likely contains too many unique values to be effectively used as a feature without more complex processing. The target variable, Selling_Price, was separated from the features.
Feature Matrix (X) and Target Vector (Y): The final feature matrix X included all columns except Car_Name and Selling_Price. The target vector Y contained only the Selling_Price.

Data Splitting

The dataset was divided into training and testing sets using the `train_test_split` function from scikit-learn. 90% of the data was allocated for training the models, and the remaining 10% was reserved for evaluating their performance on unseen data. A `random_state` of 2 was used to ensure reproducibility of the split.

Model Training

Two different regression models were trained on the prepared training data (`X_train`, `Y_train`).

1. Linear Regression: A standard Linear Regression model (LinearRegression from scikit-learn) was instantiated. This model aims to find the best linear relationship between the features and the target variable. The model was trained using the .fit() method on the training data.

2. Lasso Regression: A Lasso Regression model (Lasso from scikit-learn) was also instantiated and trained. Lasso (Least Absolute Shrinkage and Selection Operator) is a linear model that performs L1 regularization. This regularization adds a penalty equal to the absolute value of the magnitude of coefficients, which can lead to some coefficients becoming exactly zero, effectively performing feature selection. It was also trained using the .fit() method.

Model Evaluation

The performance of both models was evaluated using the R-squared (R²) metric, also known as the coefficient of determination. R² measures the proportion of the variance in the dependent variable (Selling Price) that is predictable from the independent variables (features). A score closer to 1 indicates a better fit. Both training and testing data predictions were evaluated.

Linear Regression:
- Training Data R² Score: Approximately 0.880 (calculated as 0.8799 in the notebook)
- Test Data R² Score: (Code exists, but output not shown in notebook snapshot - typically slightly lower than training R²)
Lasso Regression:
- Training Data R² Score: Approximately 0.843 (calculated as 0.8427 in the notebook)
- Test Data R² Score: (Code exists, but output not shown in notebook snapshot - typically slightly lower than training R²)

Scatter plots comparing the actual selling prices (Y_train and Y_test) against the predicted prices generated by each model were created. For both models, the plots showed points clustering reasonably well around the diagonal line (y=x), visually indicating that the predictions generally align with the actual values, especially for lower-priced cars. Some divergence was observed for higher-priced vehicles, suggesting the models might be less accurate for more expensive cars.

Model Comparison

Based on the evaluation metrics obtained (primarily the R² scores on the training data, as test scores were not explicitly printed in the provided notebook output cells), we can compare the models:

Performance: The Linear Regression model achieved a higher R² score on the training data (0.880) compared to the Lasso Regression model (0.843). This suggests that, on the data it was trained on, the standard Linear Regression model captured slightly more variance.
Regularization: Lasso Regression includes L1 regularization, which can help prevent overfitting and perform automatic feature selection by shrinking some feature coefficients to zero. While Linear Regression might achieve a higher training score, it can be more prone to overfitting if features are highly correlated or irrelevant. The slightly lower training score for Lasso might indicate a more generalized model, although the test R² scores are needed for a definitive comparison of generalization performance.
Complexity: Both models are relatively simple linear models. Lasso adds the complexity of a regularization hyperparameter (alpha), which was used with its default value in the notebook.

Best Performing Model: Without the explicit R² scores on the test set, it's difficult to definitively declare the best model for generalization. However, Linear Regression showed a stronger fit on the training data. If the test scores were similar, Linear Regression might be slightly preferred based on the training performance shown. If Linear Regression showed a significant drop in performance on the test set compared to Lasso, then Lasso might be considered more robust due to its regularization. Given the high training R² score for Linear Regression and the nature of the data, it appears to perform very well. Lasso provides a good alternative, potentially offering better stability if overfitting were a concern. Note: The prompt guidelines mentioned comparing three models, but only Linear and Lasso Regression were implemented in the provided notebook.

Conclusion

This project successfully demonstrated the application of machine learning for predicting used car selling prices. Data preprocessing steps, including encoding categorical variables, were crucial for preparing the data. Both Linear Regression and Lasso Regression models were trained and evaluated, showing strong performance on the training data with R² scores above 0.84. The Linear Regression model achieved a slightly higher R² score on the training set. Visualizations confirmed the models' predictive capabilities, particularly for lower-priced cars.

While both models performed well, having the test set evaluation metrics would provide a clearer picture of their generalization ability. Future improvements could involve exploring feature engineering (e.g., creating a 'Car_Age' feature from 'Year'), tuning the Lasso regularization parameter (alpha), trying other regression algorithms (like Ridge, Random Forest Regressor, or Gradient Boosting), and performing more in-depth exploratory data analysis with visualizations to uncover deeper insights into feature relationships.

Tools & Resources

VS Code
Jupyter Notebook
Python 3.12+

Heres repo containing code and notebook https://github.com/Ahemtan/car-price-prediction

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Project.ipynb		Project.ipynb
Report.pdf		Report.pdf
data.csv		data.csv
project_8_car_price_prediction.py		project_8_car_price_prediction.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Car Price Prediction Analysis

Introduction

Data Collection and Processing

Data Splitting

Model Training

Model Evaluation

Model Comparison

Conclusion

Tools & Resources

Thank You

About

Uh oh!

Languages

Ahemtan/car-price-prediction

Folders and files

Latest commit

History

Repository files navigation

Car Price Prediction Analysis

Introduction

Data Collection and Processing

Data Splitting

Model Training

Model Evaluation

Model Comparison

Conclusion

Tools & Resources

Thank You

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages