Automobile Price Prediction: Comprehensive Exploratory Data Analysis¶

By Mohammad Sayem Chowdhury
Last Updated: June 13, 2025

Project Overview¶

This comprehensive notebook presents my systematic approach to exploratory data analysis (EDA) for automobile price prediction. Through detailed statistical analysis, advanced visualizations, and feature engineering, I identify the key factors that drive vehicle pricing in the automotive market.

Research Questions¶

  • Primary: What features have the biggest impact on automobile pricing?
  • Secondary: How do different vehicle characteristics correlate with market value?
  • Applied: Which variables should be prioritized in predictive modeling?

Methodology¶

My analysis employs multiple statistical techniques including correlation analysis, ANOVA, descriptive statistics, and advanced visualization methods to uncover meaningful patterns in automotive pricing data.


Author: Mohammad Sayem Chowdhury
Project Type: Exploratory Data Analysis
Domain: Automotive Price Analytics
Dataset: Automobile Features & Pricing Data

Table of Contents¶

  1. Environment Setup & Data Import

    • Library imports and configuration
    • Dataset acquisition and loading
    • Initial data inspection
  2. Data Quality Assessment

    • Missing values analysis
    • Data type validation
    • Outlier detection
  3. Descriptive Statistical Analysis

    • Central tendency measures
    • Distribution analysis
    • Summary statistics
  4. Feature Pattern Visualization

    • Continuous variable relationships
    • Categorical variable distributions
    • Price correlation patterns
  5. Advanced Statistical Analysis

    • Correlation matrix analysis
    • Pearson correlation significance testing
    • ANOVA (Analysis of Variance)
  6. Data Grouping & Aggregation

    • Multi-dimensional grouping
    • Pivot table analysis
    • Heatmap visualizations
  7. Key Findings & Recommendations

    • Feature importance ranking
    • Modeling recommendations
    • Business insights

Executive Summary¶

Key Research Question¶

What automobile features have the strongest predictive power for vehicle pricing?

This analysis will systematically evaluate each vehicle characteristic to determine its relationship with market price, providing data-driven insights for both pricing strategies and predictive model development.

1. Environment Setup & Data Import¶

1.1 Library Configuration and Imports¶

Setting up the analytical environment with essential libraries for data manipulation, statistical analysis, and visualization.

Professional Setup¶

This notebook represents a comprehensive data science analysis conducted as part of my professional portfolio. All methodologies, interpretations, and code implementations reflect industry best practices and my personal expertise in automotive data analytics.

Analysis Framework: Statistical EDA with focus on predictive feature identification
Tools: Python ecosystem (Pandas, NumPy, SciPy, Seaborn, Matplotlib)
Approach: Hypothesis-driven analysis with statistical validation

Required Libraries¶

The analysis utilizes the following Python libraries:

  • pandas: Data manipulation and analysis framework
  • numpy: Numerical computing and array operations
  • matplotlib: Core plotting and visualization
  • seaborn: Statistical data visualization and aesthetics
  • scipy: Scientific computing and statistical tests

Installation Instructions¶

# For local environment
pip install pandas numpy matplotlib seaborn scipy

# For conda environment  
conda install pandas numpy matplotlib seaborn scipy
In [ ]:
# I will use Python libraries such as pandas, matplotlib, seaborn, and scipy for data analysis and visualization.
# If you need to install any libraries, use pip or conda as appropriate for your environment.

Import libraries:

If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:

In [ ]:
# Uncomment and use pip or conda to install specific versions if needed.
In [1]:
import pandas as pd
import numpy as np

This function can be used to download the dataset if needed. For my local analysis, I keep the data file in my working directory.

In [ ]:
# This function will download the dataset into your browser 

# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())

Let's load the data and take a first look at the DataFrame:

The dataset for this project is stored locally. If you need the data, you can find similar car price datasets from public sources or repositories.

In [2]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/automobileEDA.csv'

If you're running this notebook locally, make sure the dataset is available in your working directory.

# you will need to download the dataset; if you are running locally, please comment out the following await download(path, "auto.csv") path="auto.csv"

In [65]:
# await download(path, "auto.csv")
# filename="auto.csv"
In [3]:
df = pd.read_csv(path)
df.head()
Out[3]:
symboling normalized-losses make aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned diesel gas
0 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 13495.0 11.190476 Medium 0 1
1 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111.0 5000.0 21 27 16500.0 11.190476 Medium 0 1
2 1 122 alfa-romero std two hatchback rwd front 94.5 0.822681 ... 9.0 154.0 5000.0 19 26 16500.0 12.368421 Medium 0 1
3 2 164 audi std four sedan fwd front 99.8 0.848630 ... 10.0 102.0 5500.0 24 30 13950.0 9.791667 Medium 0 1
4 2 164 audi std four sedan 4wd front 99.4 0.848630 ... 8.0 115.0 5500.0 18 22 17450.0 13.055556 Medium 0 1

5 rows × 29 columns

2. Visualizing Feature Patterns¶

To install Seaborn we use pip, the Python package manager.

Import visualization packages "Matplotlib" and "Seaborn". Don't forget about "%matplotlib inline" to plot in a Jupyter notebook.

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

How to choose the right visualization method?

When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.

In [5]:
# list the data types for each column
print(df.dtypes)
symboling              int64
normalized-losses      int64
make                  object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
city-L/100km         float64
horsepower-binned     object
diesel                 int64
gas                    int64
dtype: object

2.1 Understanding My Data Structure¶

Now that I've loaded my automobile dataset, I'm curious about the structure and types of variables I'm working with. Understanding data types is crucial for determining the best analytical approaches.

Let me first examine what types of data I have in each column:

In [ ]:
# Understanding my dataset structure - let me check all data types
print("Data types in my automobile dataset:")
print(df.dtypes)
print(f"\nDataset shape: {df.shape}")

# I'm particularly interested in the peak-rpm column
print(f"\nPeak RPM data type: {df['peak-rpm'].dtypes}")
print(f"Peak RPM sample values: {df['peak-rpm'].head()}")
float64

Perfect! I can see that 'peak-rpm' is stored as float64, which makes sense since RPM values can have decimal precision. Most of my numerical features are properly typed as either float64 or int64, which will be perfect for correlation analysis.

Now I want to explore how these variables relate to each other, especially to car prices.

For example, we can calculate the correlation between variables of type "int64" or "float64" using the method "corr":

In [70]:
df.corr()
Out[70]:
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km diesel gas
symboling 1.000000 0.466264 -0.535987 -0.365404 -0.242423 -0.550160 -0.233118 -0.110581 -0.140019 -0.008245 -0.182196 0.075819 0.279740 -0.035527 0.036233 -0.082391 0.066171 -0.196735 0.196735
normalized-losses 0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404 0.112360 -0.029862 0.055563 -0.114713 0.217299 0.239543 -0.225016 -0.181877 0.133999 0.238567 -0.101546 0.101546
wheel-base -0.535987 -0.056661 1.000000 0.876024 0.814507 0.590742 0.782097 0.572027 0.493244 0.158502 0.250313 0.371147 -0.360305 -0.470606 -0.543304 0.584642 0.476153 0.307237 -0.307237
length -0.365404 0.019424 0.876024 1.000000 0.857170 0.492063 0.880665 0.685025 0.608971 0.124139 0.159733 0.579821 -0.285970 -0.665192 -0.698142 0.690628 0.657373 0.211187 -0.211187
width -0.242423 0.086802 0.814507 0.857170 1.000000 0.306002 0.866201 0.729436 0.544885 0.188829 0.189867 0.615077 -0.245800 -0.633531 -0.680635 0.751265 0.673363 0.244356 -0.244356
height -0.550160 -0.373737 0.590742 0.492063 0.306002 1.000000 0.307581 0.074694 0.180449 -0.062704 0.259737 -0.087027 -0.309974 -0.049800 -0.104812 0.135486 0.003811 0.281578 -0.281578
curb-weight -0.233118 0.099404 0.782097 0.880665 0.866201 0.307581 1.000000 0.849072 0.644060 0.167562 0.156433 0.757976 -0.279361 -0.749543 -0.794889 0.834415 0.785353 0.221046 -0.221046
engine-size -0.110581 0.112360 0.572027 0.685025 0.729436 0.074694 0.849072 1.000000 0.572609 0.209523 0.028889 0.822676 -0.256733 -0.650546 -0.679571 0.872335 0.745059 0.070779 -0.070779
bore -0.140019 -0.029862 0.493244 0.608971 0.544885 0.180449 0.644060 0.572609 1.000000 -0.055390 0.001263 0.566936 -0.267392 -0.582027 -0.591309 0.543155 0.554610 0.054458 -0.054458
stroke -0.008245 0.055563 0.158502 0.124139 0.188829 -0.062704 0.167562 0.209523 -0.055390 1.000000 0.187923 0.098462 -0.065713 -0.034696 -0.035201 0.082310 0.037300 0.241303 -0.241303
compression-ratio -0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433 0.028889 0.001263 0.187923 1.000000 -0.214514 -0.435780 0.331425 0.268465 0.071107 -0.299372 0.985231 -0.985231
horsepower 0.075819 0.217299 0.371147 0.579821 0.615077 -0.087027 0.757976 0.822676 0.566936 0.098462 -0.214514 1.000000 0.107885 -0.822214 -0.804575 0.809575 0.889488 -0.169053 0.169053
peak-rpm 0.279740 0.239543 -0.360305 -0.285970 -0.245800 -0.309974 -0.279361 -0.256733 -0.267392 -0.065713 -0.435780 0.107885 1.000000 -0.115413 -0.058598 -0.101616 0.115830 -0.475812 0.475812
city-mpg -0.035527 -0.225016 -0.470606 -0.665192 -0.633531 -0.049800 -0.749543 -0.650546 -0.582027 -0.034696 0.331425 -0.822214 -0.115413 1.000000 0.972044 -0.686571 -0.949713 0.265676 -0.265676
highway-mpg 0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889 -0.679571 -0.591309 -0.035201 0.268465 -0.804575 -0.058598 0.972044 1.000000 -0.704692 -0.930028 0.198690 -0.198690
price -0.082391 0.133999 0.584642 0.690628 0.751265 0.135486 0.834415 0.872335 0.543155 0.082310 0.071107 0.809575 -0.101616 -0.686571 -0.704692 1.000000 0.789898 0.110326 -0.110326
city-L/100km 0.066171 0.238567 0.476153 0.657373 0.673363 0.003811 0.785353 0.745059 0.554610 0.037300 -0.299372 0.889488 0.115830 -0.949713 -0.930028 0.789898 1.000000 -0.241282 0.241282
diesel -0.196735 -0.101546 0.307237 0.211187 0.244356 0.281578 0.221046 0.070779 0.054458 0.241303 0.985231 -0.169053 -0.475812 0.265676 0.198690 0.110326 -0.241282 1.000000 -1.000000
gas 0.196735 0.101546 -0.307237 -0.211187 -0.244356 -0.281578 -0.221046 -0.070779 -0.054458 -0.241303 -0.985231 0.169053 0.475812 -0.265676 -0.198690 -0.110326 0.241282 -1.000000 1.000000

The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook.

2.2 Engine Characteristics Correlation Analysis¶

I'm particularly interested in understanding how different engine characteristics relate to each other. Let me examine the correlations between bore, stroke, compression-ratio, and horsepower to identify any patterns that might influence vehicle pricing or performance.

In [ ]:
# Analyzing correlations between key engine characteristics
df[['bore','stroke','compression-ratio','horsepower']].corr()
Out[ ]:
bore stroke compression-ratio horsepower
bore 1.000000 -0.055390 0.001263 0.566936
stroke -0.055390 1.000000 0.187923 0.098462
compression-ratio 0.001263 0.187923 1.000000 -0.214514
horsepower 0.566936 0.098462 -0.214514 1.000000

Analysis Insight: The correlation matrix reveals interesting patterns among engine characteristics. Bore shows a strong positive correlation with horsepower (0.567), suggesting that larger cylinder diameters typically generate more power. The compression ratio shows minimal correlation with other variables, indicating it operates somewhat independently in engine design. These relationships provide valuable insights for understanding vehicle performance characteristics.

Continuous Numerical Variables:

Continuous numerical variables are variables that may contain any value within some range. They can be of type "int64" or "float64". A great way to visualize these variables is by using scatterplots with fitted lines.

In order to start understanding the (linear) relationship between an individual variable and the price, we can use "regplot" which plots the scatterplot plus the fitted regression line for the data.

Let's see several examples of different linear relationships:

Positive Linear Relationship

Let's find the scatterplot of "engine-size" and "price".

In [72]:
# Engine size as potential predictor variable of price
sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)
Out[72]:
(0.0, 53201.80823184669)
No description has been provided for this image

As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.

We can examine the correlation between 'engine-size' and 'price' and see that it's approximately 0.87.

In [73]:
df[["engine-size", "price"]].corr()
Out[73]:
engine-size price
engine-size 1.000000 0.872335
price 0.872335 1.000000

Highway mpg is a potential predictor variable of price. Let's find the scatterplot of "highway-mpg" and "price".

In [74]:
sns.regplot(x="highway-mpg", y="price", data=df)
Out[74]:
<AxesSubplot:xlabel='highway-mpg', ylabel='price'>
No description has been provided for this image

As highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.

We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704.

In [75]:
df[['highway-mpg', 'price']].corr()
Out[75]:
highway-mpg price
highway-mpg 1.000000 -0.704692
price -0.704692 1.000000

Weak Linear Relationship

Let's see if "peak-rpm" is a predictor variable of "price".

In [76]:
sns.regplot(x="peak-rpm", y="price", data=df)
Out[76]:
<AxesSubplot:xlabel='peak-rpm', ylabel='price'>
No description has been provided for this image

Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore, it's not a reliable variable.

We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616.

In [77]:
df[['peak-rpm','price']].corr()
Out[77]:
peak-rpm price
peak-rpm 1.000000 -0.101616
price -0.101616 1.000000

3.2 Stroke and Price Correlation Analysis¶

Before visualizing the relationship between stroke and price, let me quantify their correlation to understand the strength of this relationship statistically.

In [ ]:
# Calculating the correlation between stroke and price
df[["stroke","price"]].corr()
Out[ ]:
stroke price
stroke 1.00000 0.08231
price 0.08231 1.00000

Key Finding: The correlation between stroke and price is very weak (0.0823), confirming that engine stroke has minimal linear relationship with vehicle pricing. This suggests that stroke alone is not a significant factor in determining market value, and I should focus on other engine characteristics for price prediction.

3.3 Visual Correlation Analysis: Stroke vs. Price¶

Given the weak correlation between stroke and price that we observed earlier, I'm curious to see what this relationship looks like visually. Let me create a regression plot to better understand if there's any discernible linear pattern between engine stroke and vehicle pricing.

In [ ]:
# Creating a regression plot to visualize the stroke-price relationship
sns.regplot(x="stroke", y="price", data=df)
Out[ ]:
<AxesSubplot:xlabel='stroke', ylabel='price'>
No description has been provided for this image

Analysis Insight: The regression plot confirms my expectations based on the correlation analysis. There's indeed a very weak relationship between stroke and price, with significant scatter around the regression line. This suggests that engine stroke alone is not a reliable predictor of vehicle price, and regression modeling using this variable would likely yield poor results.

Categorical Variables

These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type "object" or "int64". A good way to visualize categorical variables is by using boxplots.

Let's look at the relationship between "body-style" and "price".

In [80]:
sns.boxplot(x="body-style", y="price", data=df)
Out[80]:
<AxesSubplot:xlabel='body-style', ylabel='price'>
No description has been provided for this image

We see that the distributions of price between the different body-style categories have a significant overlap, so body-style would not be a good predictor of price. Let's examine engine "engine-location" and "price":

In [81]:
sns.boxplot(x="engine-location", y="price", data=df)
Out[81]:
<AxesSubplot:xlabel='engine-location', ylabel='price'>
No description has been provided for this image

Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.

Let's examine "drive-wheels" and "price".

In [82]:
# drive-wheels
sns.boxplot(x="drive-wheels", y="price", data=df)
Out[82]:
<AxesSubplot:xlabel='drive-wheels', ylabel='price'>
No description has been provided for this image

Here we see that the distribution of price between the different drive-wheels categories differs. As such, drive-wheels could potentially be a predictor of price.

3. Descriptive Statistical Analysis¶

Let's first take a look at the variables by utilizing a description method.

The describe function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.

This will show:

  • the count of that variable
  • the mean
  • the standard deviation (std)
  • the minimum value
  • the IQR (Interquartile Range: 25%, 50% and 75%)
  • the maximum value

We can apply the method "describe" as follows:

In [83]:
df.describe()
Out[83]:
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km diesel gas
count 201.000000 201.00000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 197.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000
mean 0.840796 122.00000 98.797015 0.837102 0.915126 53.766667 2555.666667 126.875622 3.330692 3.256904 10.164279 103.405534 5117.665368 25.179104 30.686567 13207.129353 9.944145 0.099502 0.900498
std 1.254802 31.99625 6.066366 0.059213 0.029187 2.447822 517.296727 41.546834 0.268072 0.319256 4.004965 37.365700 478.113805 6.423220 6.815150 7947.066342 2.534599 0.300083 0.300083
min -2.000000 65.00000 86.600000 0.678039 0.837500 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000 4.795918 0.000000 0.000000
25% 0.000000 101.00000 94.500000 0.801538 0.890278 52.000000 2169.000000 98.000000 3.150000 3.110000 8.600000 70.000000 4800.000000 19.000000 25.000000 7775.000000 7.833333 0.000000 1.000000
50% 1.000000 122.00000 97.000000 0.832292 0.909722 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5125.369458 24.000000 30.000000 10295.000000 9.791667 0.000000 1.000000
75% 2.000000 137.00000 102.400000 0.881788 0.925000 55.500000 2926.000000 141.000000 3.580000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16500.000000 12.368421 0.000000 1.000000
max 3.000000 256.00000 120.900000 1.000000 1.000000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 262.000000 6600.000000 49.000000 54.000000 45400.000000 18.076923 1.000000 1.000000

The default setting of "describe" skips variables of type object. We can apply the method "describe" on the variables of type 'object' as follows:

In [84]:
df.describe(include=['object'])
Out[84]:
make aspiration num-of-doors body-style drive-wheels engine-location engine-type num-of-cylinders fuel-system horsepower-binned
count 201 201 201 201 201 201 201 201 201 200
unique 22 2 2 5 3 2 6 7 8 3
top toyota std four sedan fwd front ohc four mpfi Low
freq 32 165 115 94 118 198 145 157 92 115

Value Counts

Value counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the "value_counts" method on the column "drive-wheels". Don’t forget the method "value_counts" only works on pandas series, not pandas dataframes. As a result, we only include one bracket df['drive-wheels'], not two brackets df[['drive-wheels']].

In [85]:
df['drive-wheels'].value_counts()
Out[85]:
fwd    118
rwd     75
4wd      8
Name: drive-wheels, dtype: int64

We can convert the series to a dataframe as follows:

In [86]:
df['drive-wheels'].value_counts().to_frame()
Out[86]:
drive-wheels
fwd 118
rwd 75
4wd 8

Let's repeat the above steps but save the results to the dataframe "drive_wheels_counts" and rename the column 'drive-wheels' to 'value_counts'.

In [87]:
drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts
Out[87]:
value_counts
fwd 118
rwd 75
4wd 8

Now let's rename the index to 'drive-wheels':

In [88]:
drive_wheels_counts.index.name = 'drive-wheels'
drive_wheels_counts
Out[88]:
value_counts
drive-wheels
fwd 118
rwd 75
4wd 8

We can repeat the above process for the variable 'engine-location'.

In [89]:
# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts.head(10)
Out[89]:
value_counts
engine-location
front 198
rear 3

After examining the value counts of the engine location, we see that engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, so this result is skewed. Thus, we are not able to draw any conclusions about the engine location.

4. Grouping and Aggregation¶

The "groupby" method groups data by different categories. The data is grouped based on one or several variables, and analysis is performed on the individual groups.

For example, let's group by the variable "drive-wheels". We see that there are 3 different categories of drive wheels.

In [90]:
df['drive-wheels'].unique()
Out[90]:
array(['rwd', 'fwd', '4wd'], dtype=object)

If we want to know, on average, which type of drive wheel is most valuable, we can group "drive-wheels" and then average them.

We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable "df_group_one".

In [12]:
df_group_one = df[['drive-wheels','body-style','price']]

We can then calculate the average price for each of the different categories of data.

In [92]:
# grouping results
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
df_group_one
Out[92]:
drive-wheels price
0 4wd 10241.000000
1 fwd 9244.779661
2 rwd 19757.613333

From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.

You can also group by multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combination of 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.

In [93]:
# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1
Out[93]:
drive-wheels body-style price
0 4wd hatchback 7603.000000
1 4wd sedan 12647.333333
2 4wd wagon 9095.750000
3 fwd convertible 11595.000000
4 fwd hardtop 8249.000000
5 fwd hatchback 8396.387755
6 fwd sedan 9811.800000
7 fwd wagon 9997.333333
8 rwd convertible 23949.600000
9 rwd hardtop 24202.714286
10 rwd hatchback 14337.777778
11 rwd sedan 21711.833333
12 rwd wagon 16994.222222

This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method "pivot" to create a pivot table from the groups.

In this case, we will leave the drive-wheels variable as the rows of the table, and pivot body-style to become the columns of the table:

In [94]:
grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot
Out[94]:
price
body-style convertible hardtop hatchback sedan wagon
drive-wheels
4wd NaN NaN 7603.000000 12647.333333 9095.750000
fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.

In [95]:
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot
Out[95]:
price
body-style convertible hardtop hatchback sedan wagon
drive-wheels
4wd 0.0 0.000000 7603.000000 12647.333333 9095.750000
fwd 11595.0 8249.000000 8396.387755 9811.800000 9997.333333
rwd 23949.6 24202.714286 14337.777778 21711.833333 16994.222222

4.2 Analyzing Price Patterns by Body Style¶

To better understand how different vehicle body styles affect pricing, I want to examine the average price for each body style category. This analysis will help me identify which body styles command premium pricing in the market.

In [ ]:
# Let me calculate the average price for each body style to identify pricing patterns
df_group_two = df_group_one.groupby(['body-style'],as_index=False).mean()
df_group_two
Out[ ]:
body-style price
0 convertible 21890.500000
1 hardtop 22208.500000
2 hatchback 9957.441176
3 sedan 14459.755319
4 wagon 12371.960000

If you did not import "pyplot", let's do it again.

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline 

Variables: Drive Wheels and Body Style vs. Price

Let's use a heat map to visualize the relationship between Body Style vs Price.

In [98]:
#use the grouped results
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()
No description has been provided for this image
<Figure size 432x288 with 0 Axes>

The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' on the vertical and horizontal axis, respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.

The default labels convey no useful information to us. Let's change that:

In [99]:
fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()
No description has been provided for this image
<Figure size 432x288 with 0 Axes>

Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python visualizations course.

The main question we want to answer in this module is, "What are the main characteristics which have the most impact on the car price?".

To get a better measure of the important characteristics, we look at the correlation of these variables with the car price. In other words: how is the car price dependent on this variable?

5. Correlation and Causation¶

Correlation: a measure of the extent of interdependence between variables.

Causation: the relationship between cause and effect between two variables.

It is important to know the difference between these two. Correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.

Pearson Correlation

The Pearson Correlation measures the linear dependence between two variables X and Y.

The resulting coefficient is a value between -1 and 1 inclusive, where:

  • 1: Perfect positive linear correlation.
  • 0: No linear correlation, the two variables most likely do not affect each other.
  • -1: Perfect negative linear correlation.

Pearson Correlation is the default method of the function "corr". Like before, we can calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.

In [100]:
df.corr()
Out[100]:
symboling normalized-losses wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km diesel gas
symboling 1.000000 0.466264 -0.535987 -0.365404 -0.242423 -0.550160 -0.233118 -0.110581 -0.140019 -0.008245 -0.182196 0.075819 0.279740 -0.035527 0.036233 -0.082391 0.066171 -0.196735 0.196735
normalized-losses 0.466264 1.000000 -0.056661 0.019424 0.086802 -0.373737 0.099404 0.112360 -0.029862 0.055563 -0.114713 0.217299 0.239543 -0.225016 -0.181877 0.133999 0.238567 -0.101546 0.101546
wheel-base -0.535987 -0.056661 1.000000 0.876024 0.814507 0.590742 0.782097 0.572027 0.493244 0.158502 0.250313 0.371147 -0.360305 -0.470606 -0.543304 0.584642 0.476153 0.307237 -0.307237
length -0.365404 0.019424 0.876024 1.000000 0.857170 0.492063 0.880665 0.685025 0.608971 0.124139 0.159733 0.579821 -0.285970 -0.665192 -0.698142 0.690628 0.657373 0.211187 -0.211187
width -0.242423 0.086802 0.814507 0.857170 1.000000 0.306002 0.866201 0.729436 0.544885 0.188829 0.189867 0.615077 -0.245800 -0.633531 -0.680635 0.751265 0.673363 0.244356 -0.244356
height -0.550160 -0.373737 0.590742 0.492063 0.306002 1.000000 0.307581 0.074694 0.180449 -0.062704 0.259737 -0.087027 -0.309974 -0.049800 -0.104812 0.135486 0.003811 0.281578 -0.281578
curb-weight -0.233118 0.099404 0.782097 0.880665 0.866201 0.307581 1.000000 0.849072 0.644060 0.167562 0.156433 0.757976 -0.279361 -0.749543 -0.794889 0.834415 0.785353 0.221046 -0.221046
engine-size -0.110581 0.112360 0.572027 0.685025 0.729436 0.074694 0.849072 1.000000 0.572609 0.209523 0.028889 0.822676 -0.256733 -0.650546 -0.679571 0.872335 0.745059 0.070779 -0.070779
bore -0.140019 -0.029862 0.493244 0.608971 0.544885 0.180449 0.644060 0.572609 1.000000 -0.055390 0.001263 0.566936 -0.267392 -0.582027 -0.591309 0.543155 0.554610 0.054458 -0.054458
stroke -0.008245 0.055563 0.158502 0.124139 0.188829 -0.062704 0.167562 0.209523 -0.055390 1.000000 0.187923 0.098462 -0.065713 -0.034696 -0.035201 0.082310 0.037300 0.241303 -0.241303
compression-ratio -0.182196 -0.114713 0.250313 0.159733 0.189867 0.259737 0.156433 0.028889 0.001263 0.187923 1.000000 -0.214514 -0.435780 0.331425 0.268465 0.071107 -0.299372 0.985231 -0.985231
horsepower 0.075819 0.217299 0.371147 0.579821 0.615077 -0.087027 0.757976 0.822676 0.566936 0.098462 -0.214514 1.000000 0.107885 -0.822214 -0.804575 0.809575 0.889488 -0.169053 0.169053
peak-rpm 0.279740 0.239543 -0.360305 -0.285970 -0.245800 -0.309974 -0.279361 -0.256733 -0.267392 -0.065713 -0.435780 0.107885 1.000000 -0.115413 -0.058598 -0.101616 0.115830 -0.475812 0.475812
city-mpg -0.035527 -0.225016 -0.470606 -0.665192 -0.633531 -0.049800 -0.749543 -0.650546 -0.582027 -0.034696 0.331425 -0.822214 -0.115413 1.000000 0.972044 -0.686571 -0.949713 0.265676 -0.265676
highway-mpg 0.036233 -0.181877 -0.543304 -0.698142 -0.680635 -0.104812 -0.794889 -0.679571 -0.591309 -0.035201 0.268465 -0.804575 -0.058598 0.972044 1.000000 -0.704692 -0.930028 0.198690 -0.198690
price -0.082391 0.133999 0.584642 0.690628 0.751265 0.135486 0.834415 0.872335 0.543155 0.082310 0.071107 0.809575 -0.101616 -0.686571 -0.704692 1.000000 0.789898 0.110326 -0.110326
city-L/100km 0.066171 0.238567 0.476153 0.657373 0.673363 0.003811 0.785353 0.745059 0.554610 0.037300 -0.299372 0.889488 0.115830 -0.949713 -0.930028 0.789898 1.000000 -0.241282 0.241282
diesel -0.196735 -0.101546 0.307237 0.211187 0.244356 0.281578 0.221046 0.070779 0.054458 0.241303 0.985231 -0.169053 -0.475812 0.265676 0.198690 0.110326 -0.241282 1.000000 -1.000000
gas 0.196735 0.101546 -0.307237 -0.211187 -0.244356 -0.281578 -0.221046 -0.070779 -0.054458 -0.241303 -0.985231 0.169053 0.475812 -0.265676 -0.198690 -0.110326 0.241282 -1.000000 1.000000

Sometimes we would like to know the significant of the correlation estimate.

P-value

What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.

By convention, when the

  • p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.
  • the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.
  • the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.
  • the p-value is $>$ 0.1: there is no evidence that the correlation is significant.

We can obtain this information using "stats" module in the "scipy" library.

In [15]:
from scipy import stats

Wheel-Base vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'.

In [102]:
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  
The Pearson Correlation Coefficient is 0.5846418222655085  with a P-value of P = 8.076488270732243e-20

Conclusion:

Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585).

Horsepower vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'.

In [103]:
pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
The Pearson Correlation Coefficient is 0.8095745670036559  with a P-value of P =  6.369057428260101e-48

Conclusion:

Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1).

Length vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'.

In [104]:
pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
The Pearson Correlation Coefficient is 0.6906283804483643  with a P-value of P =  8.01647746615853e-30

Conclusion:

Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).

Width vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':

In [105]:
pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value ) 
The Pearson Correlation Coefficient is 0.7512653440522666  with a P-value of P = 9.200335510483739e-38

Conclusion:¶

Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751).

Curb-Weight vs. Price¶

Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':

In [106]:
pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
The Pearson Correlation Coefficient is 0.8344145257702845  with a P-value of P =  2.189577238893816e-53

Conclusion:

Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).

Engine-Size vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':

In [107]:
pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value) 
The Pearson Correlation Coefficient is 0.8723351674455188  with a P-value of P = 9.265491622196808e-64

Conclusion:

Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).

Bore vs. Price

Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':

In [108]:
pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =  ", p_value ) 
The Pearson Correlation Coefficient is 0.54315538326266  with a P-value of P =   8.049189483935489e-17

Conclusion:

Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).

We can relate the process for each 'city-mpg' and 'highway-mpg':

City-mpg vs. Price

In [109]:
pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)  
The Pearson Correlation Coefficient is -0.6865710067844684  with a P-value of P =  2.3211320655672453e-29

Conclusion:

Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of about -0.687 shows that the relationship is negative and moderately strong.

Highway-mpg vs. Price

In [110]:
pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])
print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value ) 
The Pearson Correlation Coefficient is -0.7046922650589534  with a P-value of P =  1.749547114447437e-31

Conclusion:¶

Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of about -0.705 shows that the relationship is negative and moderately strong.

6. ANOVA (Analysis of Variance)¶

ANOVA: Analysis of Variance

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.

P-value: P-value tells how statistically significant our calculated score value is.

If our price variable is strongly correlated with the variable we are analyzing, we expect ANOVA to return a sizeable F-test score and a small p-value.

Drive Wheels

Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.

To see if different types of 'drive-wheels' impact 'price', we group the data.

In [111]:
grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.head(2)
Out[111]:
drive-wheels price
0 rwd 13495.0
1 rwd 16500.0
3 fwd 13950.0
4 4wd 17450.0
5 fwd 15250.0
136 4wd 7603.0
In [112]:
df_gptest
Out[112]:
drive-wheels body-style price
0 rwd convertible 13495.0
1 rwd convertible 16500.0
2 rwd hatchback 16500.0
3 fwd sedan 13950.0
4 4wd sedan 17450.0
... ... ... ...
196 rwd sedan 16845.0
197 rwd sedan 19045.0
198 rwd sedan 21485.0
199 rwd sedan 22470.0
200 rwd sedan 22625.0

201 rows × 3 columns

We can obtain the values of the method group using the method "get_group".

In [113]:
grouped_test2.get_group('4wd')['price']
Out[113]:
4      17450.0
136     7603.0
140     9233.0
141    11259.0
144     8013.0
145    11694.0
150     7898.0
151     8778.0
Name: price, dtype: float64

We can use the function 'f_oneway' in the module 'stats' to obtain the F-test score and P-value.

In [114]:
# ANOVA
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val)   
ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23

This is a great result with a large F-test score showing a strong correlation and a P-value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated?

Let's examine them separately.

fwd and rwd¶

In [115]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'])  
 
print( "ANOVA results: F=", f_val, ", P =", p_val )
ANOVA results: F= 130.5533160959111 , P = 2.2355306355677845e-23

Let's examine the other groups.

4wd and rwd¶

In [116]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price'])  
   
print( "ANOVA results: F=", f_val, ", P =", p_val)   
ANOVA results: F= 8.580681368924756 , P = 0.004411492211225333

4wd and fwd

In [117]:
f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price'])  
 
print("ANOVA results: F=", f_val, ", P =", p_val)   
ANOVA results: F= 0.665465750252303 , P = 0.41620116697845655

Conclusion: Important Variables

We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:

Continuous numerical variables:

  • Length
  • Width
  • Curb-weight
  • Engine-size
  • Horsepower
  • City-mpg
  • Highway-mpg
  • Wheel-base
  • Bore

Categorical variables:

  • Drive-wheels

As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.

Thank you for completing this lab!¶


This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.

Thank you for exploring my approach to exploratory data analysis! If you have any feedback or suggestions, feel free to reach out.