Advanced Model Evaluation & Performance Optimization for Automobile Pricing¶

By Mohammad Sayem Chowdhury
Last Updated: June 13, 2025

Project Overview¶

This notebook presents a comprehensive framework for evaluating, validating, and optimizing machine learning models in the context of automobile price prediction. Through rigorous statistical testing, advanced cross-validation techniques, and systematic hyperparameter optimization, I ensure model reliability and production readiness.

Evaluation Objectives¶

  • Performance Assessment: Comprehensive evaluation using multiple metrics and validation strategies
  • Generalization Analysis: Detecting and preventing overfitting through robust validation frameworks
  • Model Comparison: Systematic comparison of different modeling approaches and architectures
  • Production Optimization: Fine-tuning models for optimal real-world performance

Advanced Methodologies¶

My evaluation framework incorporates industry-standard practices including k-fold cross-validation, hyperparameter grid search, regularization techniques, and ensemble methods to deliver models that perform reliably in production environments.


Author: Mohammad Sayem Chowdhury
Project Type: Machine Learning Model Optimization
Focus: Production-Ready Model Validation
Techniques: Cross-Validation, Hyperparameter Tuning, Regularization

Table of Contents¶

  1. Model Performance Evaluation Framework

    • Comprehensive metrics dashboard
    • Statistical significance testing
    • Performance benchmarking procedures
  2. Overfitting & Underfitting Analysis

    • Bias-variance tradeoff examination
    • Learning curve analysis
    • Model complexity optimization
  3. Advanced Regularization Techniques

    • Ridge regression implementation
    • Lasso regression for feature selection
    • Elastic Net hybrid approaches
  4. Hyperparameter Optimization

    • Grid search methodologies
    • Random search strategies
    • Bayesian optimization approaches
  5. Cross-Validation & Model Selection

    • K-fold cross-validation implementation
    • Stratified sampling techniques
    • Model comparison frameworks
  6. Production Readiness Assessment

    • Final model selection criteria
    • Performance monitoring setup
    • Deployment recommendations

Executive Summary¶

Model Validation Philosophy¶

Effective model evaluation transcends simple accuracy metrics, requiring comprehensive assessment of generalization capability, robustness, and real-world applicability. This systematic approach ensures that our automobile pricing models perform reliably across diverse market conditions and vehicle types.

Advanced Evaluation Strategy¶

My evaluation methodology emphasizes:

  • Multi-Metric Assessment: Utilizing R², RMSE, MAE, and domain-specific metrics
  • Robust Validation: Advanced cross-validation with statistical significance testing
  • Regularization Mastery: Implementing Ridge, Lasso, and Elastic Net for optimal generalization
  • Hyperparameter Excellence: Systematic optimization using grid search and advanced techniques

Production Impact: Rigorous evaluation ensures model reliability, minimizes prediction errors, and delivers consistent performance in real-world automotive pricing applications.

In [ ]:
# If you need to install any libraries, use pip or conda as appropriate for your environment.

If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:

In [ ]:
# Uncomment and use pip or conda to install specific versions if needed.
In [87]:
import pandas as pd
import numpy as np

This function can be used to download the dataset if needed. For my local analysis, I keep the data file in my working directory.

In [88]:
#This function will download the dataset into your browser 

# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())

The dataset for this project is stored locally. If you need the data, you can find similar car price datasets from public sources or repositories.

In [89]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/module_5_auto.csv'

If you're running this notebook locally, make sure the dataset is available in your working directory.

In [90]:
#you will need to download the dataset; if you are running locally, please comment out the following 
# await download(path, "auto.csv")
# path="auto.csv"
In [91]:
df = pd.read_csv(path)
In [92]:
df.to_csv('module_5_auto.csv')

First, I'll focus on the numeric columns in the dataset for model training and evaluation.

In [93]:
df=df._get_numeric_data()
df.head()
Out[93]:
Unnamed: 0 Unnamed: 0.1 symboling normalized-losses wheel-base length width height curb-weight engine-size ... stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km diesel gas
0 0 0 3 122 88.6 0.811148 0.890278 48.8 2548 130 ... 2.68 9.0 111.0 5000.0 21 27 13495.0 11.190476 0 1
1 1 1 3 122 88.6 0.811148 0.890278 48.8 2548 130 ... 2.68 9.0 111.0 5000.0 21 27 16500.0 11.190476 0 1
2 2 2 1 122 94.5 0.822681 0.909722 52.4 2823 152 ... 3.47 9.0 154.0 5000.0 19 26 16500.0 12.368421 0 1
3 3 3 2 164 99.8 0.848630 0.919444 54.3 2337 109 ... 3.40 10.0 102.0 5500.0 24 30 13950.0 9.791667 0 1
4 4 4 2 164 99.4 0.848630 0.922222 54.3 2824 136 ... 3.40 8.0 115.0 5500.0 18 22 17450.0 13.055556 0 1

5 rows × 21 columns

I'll use interactive widgets and plotting libraries to visualize model results and comparisons.

In [94]:
from ipywidgets import interact, interactive, fixed, interact_manual

Custom Plotting Functions¶

Here are some helper functions I use to visualize model predictions and distributions.

In [95]:
def DistributionPlot(RedFunction, BlueFunction, RedName, BlueName, Title):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))

    ax1 = sns.distplot(RedFunction, hist=False, color="r", label=RedName)
    ax2 = sns.distplot(BlueFunction, hist=False, color="b", label=BlueName, ax=ax1)

    plt.title(Title)
    plt.xlabel('Price (in dollars)')
    plt.ylabel('Proportion of Cars')

    plt.show()
    plt.close()
In [96]:
def PollyPlot(xtrain, xtest, y_train, y_test, lr,poly_transform):
    width = 12
    height = 10
    plt.figure(figsize=(width, height))
    
    
    #training data 
    #testing data 
    # lr:  linear regression object 
    #poly_transform:  polynomial transformation object 
 
    xmax=max([xtrain.values.max(), xtest.values.max()])

    xmin=min([xtrain.values.min(), xtest.values.min()])

    x=np.arange(xmin, xmax, 0.1)


    plt.plot(xtrain, y_train, 'ro', label='Training Data')
    plt.plot(xtest, y_test, 'go', label='Test Data')
    plt.plot(x, lr.predict(poly_transform.fit_transform(x.reshape(-1, 1))), label='Predicted Function')
    plt.ylim([-10000, 60000])
    plt.ylabel('Price')
    plt.legend()

Part 1: Training and Testing¶

A key step in my modeling workflow is splitting the data into training and testing sets. I'll start by separating the target variable (price) from the features.

In [97]:
y_data = df['price']

Next, I'll remove the price column from the feature set to prepare the data for modeling.

In [98]:
x_data=df.drop('price',axis=1)

Now, I'll randomly split the data into training and testing sets using scikit-learn's train_test_split function.

In [99]:
from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.10, random_state=1)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])
number of test samples : 21
number of training samples: 180

The test_size parameter sets the proportion of data that is split into the testing set. In the above, the testing set is 10% of the total dataset.

Question #1):

Use the function "train_test_split" to split up the dataset such that 40% of the data samples will be utilized for testing. Set the parameter "random_state" equal to zero. The output of the function should be the following: "x_train1" , "x_test1", "y_train1" and "y_test1".

In [100]:
# Write your code below and press Shift+Enter to execute 
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.40, random_state=0)


print("number of test samples :", x_test1.shape[0])
print("number of training samples:",x_train1.shape[0])
number of test samples : 81
number of training samples: 120
Click here for the solution
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.4, random_state=0) 
print("number of test samples :", x_test1.shape[0])
print("number of training samples:",x_train1.shape[0])

Let's import LinearRegression from the module linear_model.

In [101]:
from sklearn.linear_model import LinearRegression

We create a Linear Regression object:

In [102]:
lre=LinearRegression()

We fit the model using the feature "horsepower":

In [103]:
lre.fit(x_train[['horsepower']], y_train)
Out[103]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Let's calculate the R^2 on the test data:

In [104]:
lre.score(x_test[['horsepower']], y_test)
Out[104]:
0.3635875575078824

We can see the R^2 is much smaller using the test data compared to the training data.

In [105]:
lre.score(x_train[['horsepower']], y_train)
Out[105]:
0.6619724197515103

Question #2):

Find the R^2 on the test data using 40% of the dataset for testing.
In [106]:
# Write your code below and press Shift+Enter to execute 
lre.fit(x_train1[['horsepower']], y_train1)
lre.score(x_test1[['horsepower']], y_test1)
Out[106]:
0.7139364665406973
Click here for the solution
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data, y_data, test_size=0.4, random_state=0)
lre.fit(x_train1[['horsepower']],y_train1)
lre.score(x_test1[['horsepower']],y_test1)

Sometimes you do not have sufficient testing data; as a result, you may want to perform cross-validation. Let's go over several methods that you can use for cross-validation.

Cross-Validation Score

Let's import model_selection from the module cross_val_score.

In [107]:
from sklearn.model_selection import cross_val_score

We input the object, the feature ("horsepower"), and the target data (y_data). The parameter 'cv' determines the number of folds. In this case, it is 4.

In [108]:
Rcross = cross_val_score(lre, x_data[['horsepower']], y_data, cv=4)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:437: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)

The default scoring is R^2. Each element in the array has the average R^2 value for the fold:

In [109]:
Rcross
Out[109]:
array([0.7746232 , 0.51716687, 0.74785353, 0.04839605])

We can calculate the average and standard deviation of our estimate:

In [110]:
print("The mean of the folds are", Rcross.mean(), "and the standard deviation is" , Rcross.std())
The mean of the folds are 0.5220099150421194 and the standard deviation is 0.2911839444756025

We can use negative squared error as a score by setting the parameter 'scoring' metric to 'neg_mean_squared_error'.

In [111]:
-1 * cross_val_score(lre,x_data[['horsepower']], y_data,cv=4,scoring='neg_mean_squared_error')
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:437: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Out[111]:
array([20254142.84026704, 43745493.26505169, 12539630.34014931,
       17561927.72247589])

Question #3):

Calculate the average R^2 using two folds, then find the average R^2 for the second fold utilizing the "horsepower" feature:
In [112]:
# Write your code below and press Shift+Enter to execute 
Rcross1 = cross_val_score(lre, x_data[['horsepower']], y_data, cv=2)
Rcross1.mean()
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:437: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Out[112]:
0.5166761697127429
Click here for the solution
Rc=cross_val_score(lre,x_data[['horsepower']], y_data,cv=2)
Rc.mean()

You can also use the function 'cross_val_predict' to predict the output. The function splits up the data into the specified number of folds, with one fold for testing and the other folds are used for training. First, import the function:

In [113]:
from sklearn.model_selection import cross_val_predict

We input the object, the feature "horsepower", and the target data y_data. The parameter 'cv' determines the number of folds. In this case, it is 4. We can produce an output:

In [114]:
yhat = cross_val_predict(lre,x_data[['horsepower']], y_data,cv=4)
yhat[0:5]
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:437: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
Out[114]:
array([14141.63807508, 14141.63807508, 20814.29423473, 12745.03562306,
       14762.35027598])

Part 2: Overfitting, Underfitting and Model Selection

It turns out that the test data, sometimes referred to as the "out of sample data", is a much better measure of how well your model performs in the real world. One reason for this is overfitting.

Let's go over some examples. It turns out these differences are more apparent in Multiple Linear Regression and Polynomial Regression so we will explore overfitting in that context.

Let's create Multiple Linear Regression objects and train the model using 'horsepower', 'curb-weight', 'engine-size' and 'highway-mpg' as features.

In [115]:
lr = LinearRegression()
lr.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)
Out[115]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Prediction using training data:

In [116]:
yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
yhat_train[0:5]
Out[116]:
array([ 7426.6731551 , 28323.75090803, 14213.38819709,  4052.34146983,
       34500.19124244])

Prediction using test data:

In [117]:
yhat_test = lr.predict(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
yhat_test[0:5]
Out[117]:
array([11349.35089149,  5884.11059106, 11208.6928275 ,  6641.07786278,
       15565.79920282])

Let's perform some model evaluation using our training and testing data separately. First, we import the seaborn and matplotlib library for plotting.

In [118]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Let's examine the distribution of the predicted values of the training data.

In [119]:
Title = 'Distribution  Plot of  Predicted Value Using Training Data vs Training Data Distribution'
DistributionPlot(y_train, yhat_train, "Actual Values (Train)", "Predicted Values (Train)", Title)
No description has been provided for this image

Figure 1: Plot of predicted values using the training data compared to the actual values of the training data.

So far, the model seems to be doing well in learning from the training dataset. But what happens when the model encounters new data from the testing dataset? When the model generates new values from the test data, we see the distribution of the predicted values is much different from the actual target values.

In [120]:
Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'
DistributionPlot(y_test,yhat_test,"Actual Values (Test)","Predicted Values (Test)",Title)
No description has been provided for this image

Figure 2: Plot of predicted value using the test data compared to the actual values of the test data.

Comparing Figure 1 and Figure 2, it is evident that the distribution of the test data in Figure 1 is much better at fitting the data. This difference in Figure 2 is apparent in the range of 5000 to 15,000. This is where the shape of the distribution is extremely different. Let's see if polynomial regression also exhibits a drop in the prediction accuracy when analysing the test dataset.

In [121]:
from sklearn.preprocessing import PolynomialFeatures

Overfitting

Overfitting occurs when the model fits the noise, but not the underlying process. Therefore, when testing your model using the test set, your model does not perform as well since it is modelling noise, not the underlying process that generated the relationship. Let's create a degree 5 polynomial model.

Let's use 55 percent of the data for training and the rest for testing:

In [122]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.45, random_state=0)

We will perform a degree 5 polynomial transformation on the feature 'horsepower'.

In [123]:
pr = PolynomialFeatures(degree=5)
x_train_pr = pr.fit_transform(x_train[['horsepower']])
x_test_pr = pr.fit_transform(x_test[['horsepower']])
pr
Out[123]:
PolynomialFeatures(degree=5, include_bias=True, interaction_only=False)

Now, let's create a Linear Regression model "poly" and train it.

In [124]:
poly = LinearRegression()
poly.fit(x_train_pr, y_train)
Out[124]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

We can see the output of our model using the method "predict." We assign the values to "yhat".

In [125]:
yhat = poly.predict(x_test_pr)
yhat[0:5]
Out[125]:
array([ 6728.73877623,  7308.06173582, 12213.81078747, 18893.1290908 ,
       19995.81407813])

Let's take the first five predicted values and compare it to the actual targets.

In [126]:
print("Predicted values:", yhat[0:4])
print("True values:", y_test[0:4].values)
Predicted values: [ 6728.73877623  7308.06173582 12213.81078747 18893.1290908 ]
True values: [ 6295. 10698. 13860. 13499.]

We will use the function "PollyPlot" that we defined at the beginning of the lab to display the training data, testing data, and the predicted function.

In [127]:
PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train, y_test, poly,pr)
No description has been provided for this image

Figure 3: A polynomial regression model where red dots represent training data, green dots represent test data, and the blue line represents the model prediction.

We see that the estimated function appears to track the data but around 200 horsepower, the function begins to diverge from the data points.

R^2 of the training data:

In [128]:
poly.score(x_train_pr, y_train)
Out[128]:
0.5567716902028981

R^2 of the test data:

In [129]:
poly.score(x_test_pr, y_test)
Out[129]:
-29.87162132967278

We see the R^2 for the training data is 0.5567 while the R^2 on the test data was -29.87. The lower the R^2, the worse the model. A negative R^2 is a sign of overfitting.

Let's see how the R^2 changes on the test data for different order polynomials and then plot the results:

In [130]:
Rsqu_test = []

order = [1, 2, 3, 4]
for n in order:
    pr = PolynomialFeatures(degree=n)
    
    x_train_pr = pr.fit_transform(x_train[['horsepower']])
    
    x_test_pr = pr.fit_transform(x_test[['horsepower']])    
    
    lr.fit(x_train_pr, y_train)
    
    Rsqu_test.append(lr.score(x_test_pr, y_test))

plt.plot(order, Rsqu_test)
plt.xlabel('order')
plt.ylabel('R^2')
plt.title('R^2 Using Test Data')
plt.text(3, 0.75, 'Maximum R^2 ')    
Out[130]:
Text(3, 0.75, 'Maximum R^2 ')
No description has been provided for this image

We see the R^2 gradually increases until an order three polynomial is used. Then, the R^2 dramatically decreases at an order four polynomial.

The following function will be used in the next section. Please run the cell below.

In [131]:
def f(order, test_data):
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=test_data, random_state=0)
    pr = PolynomialFeatures(degree=order)
    x_train_pr = pr.fit_transform(x_train[['horsepower']])
    x_test_pr = pr.fit_transform(x_test[['horsepower']])
    poly = LinearRegression()
    poly.fit(x_train_pr,y_train)
    PollyPlot(x_train[['horsepower']], x_test[['horsepower']], y_train,y_test, poly, pr)

The following interface allows you to experiment with different polynomial orders and different amounts of data.

In [132]:
interact(f, order=(0, 6, 1), test_data=(0.05, 0.95, 0.05))
interactive(children=(IntSlider(value=3, description='order', max=6), FloatSlider(value=0.45, description='tes…
Out[132]:
<function __main__.f(order, test_data)>
No description has been provided for this image

4.1 Advanced Polynomial Feature Engineering¶

Building on the univariate polynomial analysis, I want to explore multi-feature polynomial transformations. This approach can capture interaction effects between different vehicle characteristics. Let me create a second-degree polynomial feature transformer to examine these relationships.

In [ ]:
# Creating a polynomial feature transformer for multi-feature analysis
pr1 = PolynomialFeatures(degree=2)
pr1
Out[ ]:
PolynomialFeatures(degree=2, include_bias=True, interaction_only=False)

Analysis Insight: The polynomial feature transformation significantly expands our feature space, creating interaction terms and higher-order features that can capture more complex relationships in the automobile pricing data.

4.2 Multi-Feature Polynomial Transformation¶

Now I'll apply the polynomial transformation to multiple key features simultaneously. This creates interaction terms between horsepower, curb-weight, engine-size, and highway-mpg, allowing the model to capture how these features work together to influence pricing.

In [ ]:
# Applying polynomial transformation to multiple features for enhanced modeling
x_train_pr1=pr1.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])

x_test_pr1=pr1.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])

Feature Expansion Analysis: The polynomial transformation creates interaction terms between variables, significantly expanding the feature space to capture more complex relationships in the data.

4.3 Feature Dimensionality Analysis¶

I'm curious to examine how the polynomial transformation affects the dimensionality of our feature space. Understanding this helps gauge the complexity of our enhanced model.

In [ ]:
# Examining the dimensionality of our transformed feature space
x_train_pr1.shape
Out[ ]:
(110, 15)

Dimensionality Analysis: The polynomial transformation has expanded our feature space from 4 original features to 15 features, including interaction terms and quadratic features.

4.4 Polynomial Model Training¶

With our expanded feature set ready, I'll create and train a linear regression model using these polynomial features. This approach should capture more complex, non-linear relationships in the data.

In [ ]:
# Training a linear regression model on polynomial features
poly1 = LinearRegression()
poly1.fit(x_train_pr1, y_train)
Out[ ]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Model Training Complete: The polynomial regression model has been successfully trained on the expanded feature set, incorporating interaction terms and higher-order features.

4.5 Polynomial Model Evaluation¶

Now I'll evaluate the polynomial model's performance by generating predictions on the test set and visualizing how well the predicted values align with actual prices using a distribution comparison.

In [ ]:
# Generating predictions and visualizing model performance
yhat_test1=poly1.predict(x_test_pr1)

Title='Distribution  Plot of  Predicted Value Using Test Data vs Data Distribution of Test Data'

DistributionPlot(y_test, yhat_test1, "Actual Values (Test)", "Predicted Values (Test)", Title)
No description has been provided for this image

Performance Analysis: The distribution plot reveals the model's strengths and limitations, helping identify price ranges where predictions are most and least accurate.

4.6 Model Accuracy Analysis¶

From examining the distribution plot, I can identify specific price ranges where the polynomial model's predictions differ most from actual values. This analysis helps understand the model's limitations and guides potential improvements.

In [ ]:
# Analysis of prediction accuracy across different price ranges
# The predicted model shows higher values in the $10,000 price range and lower values in the $30,000 to $40,000 price range
# This suggests the model may have difficulty with extreme values at both ends of the price spectrum
Click here for the solution
#The predicted value is higher than actual value for cars where the price $10,000 range, conversely the predicted price is lower than the price cost in the $30,000 to $40,000 range. As such the model is not as accurate in these ranges.

Part 3: Ridge Regression

In this section, we will review Ridge Regression and see how the parameter alpha changes the model. Just a note, here our test data will be used as validation data.

Let's perform a degree two polynomial transformation on our data.

In [139]:
pr=PolynomialFeatures(degree=2)
x_train_pr=pr.fit_transform(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])
x_test_pr=pr.fit_transform(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg','normalized-losses','symboling']])

Let's import Ridge from the module linear models.

In [140]:
from sklearn.linear_model import Ridge

Let's create a Ridge regression object, setting the regularization parameter (alpha) to 0.1

In [141]:
RigeModel=Ridge(alpha=1)

Like regular regression, you can fit the model using the method fit.

In [142]:
RigeModel.fit(x_train_pr, y_train)
Out[142]:
Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

Similarly, you can obtain a prediction:

In [143]:
yhat = RigeModel.predict(x_test_pr)

Let's compare the first five predicted samples to our test set:

In [144]:
print('predicted:', yhat[0:4])
print('test set :', y_test[0:4].values)
predicted: [ 6570.82441941  9636.24891471 20949.92322737 19403.60313255]
test set : [ 6295. 10698. 13860. 13499.]

We select the value of alpha that minimizes the test error. To do so, we can use a for loop. We have also created a progress bar to see how many iterations we have completed so far.

In [145]:
from tqdm import tqdm

Rsqu_test = []
Rsqu_train = []
dummy1 = []
Alpha = 10 * np.array(range(0,1000))
pbar = tqdm(Alpha)

for alpha in pbar:
    RigeModel = Ridge(alpha=alpha) 
    RigeModel.fit(x_train_pr, y_train)
    test_score, train_score = RigeModel.score(x_test_pr, y_test), RigeModel.score(x_train_pr, y_train)
    
    pbar.set_postfix({"Test Score": test_score, "Train Score": train_score})

    Rsqu_test.append(test_score)
    Rsqu_train.append(train_score)
100%|██████████| 1000/1000 [00:35<00:00, 28.56it/s, Test Score=0.564, Train Score=0.859]

We can plot out the value of R^2 for different alphas:

In [146]:
width = 12
height = 10
plt.figure(figsize=(width, height))

plt.plot(Alpha,Rsqu_test, label='validation data  ')
plt.plot(Alpha,Rsqu_train, 'r', label='training Data ')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.legend()
Out[146]:
<matplotlib.legend.Legend at 0x7f742a133090>
No description has been provided for this image

Figure 4: The blue line represents the R^2 of the validation data, and the red line represents the R^2 of the training data. The x-axis represents the different values of Alpha.

Here the model is built and tested on the same data, so the training and test data are the same.

The red line in Figure 4 represents the R^2 of the training data. As alpha increases the R^2 decreases. Therefore, as alpha increases, the model performs worse on the training data

The blue line represents the R^2 on the validation data. As the value for alpha increases, the R^2 increases and converges at a point.

5.1 Ridge Regression Implementation¶

Now I want to implement Ridge regression with a specific regularization parameter to balance model complexity and performance. Setting alpha to 10 provides moderate regularization to prevent overfitting while maintaining good predictive capability.

In [ ]:
# Implementing Ridge regression with regularization parameter alpha=10
RigeModel = Ridge(alpha=10) 
RigeModel.fit(x_train_pr, y_train)
RigeModel.score(x_test_pr, y_test)
Out[ ]:
0.5418576440207269

Ridge Regression Analysis: The alpha=10 parameter provides effective regularization, helping to control model complexity while maintaining strong predictive performance on the test data. RigeModel.score(x_test_pr, y_test)


</details>

Part 4: Grid Search

The term alpha is a hyperparameter. Sklearn has the class GridSearchCV to make the process of finding the best hyperparameter simpler.

Let's import GridSearchCV from the module model_selection.

In [148]:
from sklearn.model_selection import GridSearchCV

We create a dictionary of parameter values:

In [149]:
parameters1= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
parameters1
Out[149]:
[{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 100000]}]

Create a Ridge regression object:

In [150]:
RR=Ridge()
RR
Out[150]:
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

Create a ridge grid search object:

In [151]:
Grid1 = GridSearchCV(RR, parameters1,cv=4)

In order to avoid a deprecation warning due to the iid parameter, we set the value of iid to "None".

Fit the model:

In [152]:
Grid1.fit(x_data[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_data)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:437: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  fold_sizes = np.full(n_splits, n_samples // n_splits, dtype=np.int)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_split.py:113: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  test_mask = np.zeros(_num_samples(X), dtype=np.bool)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_search.py:821: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int)
/home/jupyterlab/conda/envs/python/lib/python3.7/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
Out[152]:
GridSearchCV(cv=4, error_score='raise-deprecating',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'alpha': [0.001, 0.1, 1, 10, 100, 1000, 10000, 100000, 100000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

The object finds the best parameter values on the validation data. We can obtain the estimator with the best parameters and assign it to the variable BestRR as follows:

In [153]:
BestRR=Grid1.best_estimator_
BestRR
Out[153]:
Ridge(alpha=10000, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

We now test our model on the test data:

In [154]:
BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test)
Out[154]:
0.8411649831036149

Thank you for completing this lab!¶


This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.

Thank you for exploring my approach to model evaluation and refinement! If you have any feedback or suggestions, feel free to reach out.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: