Automobile Data Wrangling & Preprocessing: Professional Pipeline¶

By Mohammad Sayem Chowdhury
Last Updated: June 13, 2025

Project Overview¶

This notebook demonstrates a comprehensive, production-ready data wrangling and preprocessing pipeline specifically designed for automobile datasets. Through systematic data cleaning, validation, and transformation processes, I prepare raw automotive data for advanced statistical analysis and machine learning applications.

Data Quality Objectives¶

  • Completeness: Handle missing values using domain-appropriate imputation strategies
  • Consistency: Standardize data formats and ensure uniform data types
  • Accuracy: Detect and correct outliers and anomalous values
  • Usability: Transform data into ML-ready format with proper feature engineering

Professional Standards¶

This preprocessing pipeline follows industry best practices for data quality assurance, featuring comprehensive documentation, validation checkpoints, and reproducible transformation procedures suitable for production environments.


Author: Mohammad Sayem Chowdhury
Project Type: Data Engineering & Preprocessing
Domain: Automotive Data Analytics
Framework: Production-Ready Data Pipeline

Table of Contents¶

  1. Environment Setup & Data Import

    • Library configuration and imports
    • Dataset loading and initial inspection
    • Data quality assessment framework
  2. Missing Value Analysis & Treatment

    • Missing value identification and patterns
    • Domain-specific imputation strategies
    • Validation of imputation results
  3. Data Type Optimization & Format Correction

    • Data type validation and conversion
    • Format standardization procedures
    • Categorical encoding preparation
  4. Data Standardization & Normalization

    • Feature scaling methodologies
    • Distribution normalization techniques
    • Cross-feature consistency checks
  5. Feature Engineering & Transformation

    • Binning and discretization strategies
    • Indicator variable creation
    • Advanced feature construction
  6. Data Quality Validation & Export

    • Final quality assurance checks
    • Performance optimization
    • Clean dataset export procedures

Executive Summary¶

The Critical Role of Data Wrangling¶

Data wrangling represents the foundation of any successful analytics project, typically consuming 60-80% of a data scientist's time. This systematic approach to data preparation ensures that downstream analysis and modeling efforts are built upon a solid, reliable foundation.

Professional Approach to Data Preprocessing¶

This notebook demonstrates enterprise-grade data wrangling techniques specifically tailored for automotive datasets. My methodology emphasizes:

  • Systematic Quality Assessment: Comprehensive evaluation of data integrity issues
  • Domain-Aware Processing: Automotive industry-specific validation rules and transformations
  • Reproducible Procedures: Fully documented and version-controlled preprocessing steps
  • Performance Optimization: Efficient processing suitable for large-scale production datasets

Business Impact: Clean, well-structured data directly translates to more accurate models, reliable insights, and trustworthy business decisions.

What is the fuel consumption (L/100k) rate for the diesel car?

Import Data¶

For this project, I'll use a car dataset in CSV format. You can find similar datasets from public sources or repositories. I'll demonstrate how I load and inspect the data using pandas.

Import pandas

If you need to install any libraries, use pip or conda as appropriate for your environment.¶

In [ ]:
import piplite
await piplite.install(['pandas'])
await piplite.install(['matplotlib'])

If you run the lab locally, you can load the correct library and versions by uncommenting the following:

Uncomment and use pip or conda to install specific versions if needed.¶

In [ ]:
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
In [1]:
import pandas as pd
import matplotlib.pylab as plt

This function can be used to download the dataset if needed. For my local analysis, I keep the data file in my working directory.

In [2]:
#This function will download the dataset into your browser 
# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())

Reading the Dataset and Adding Headers¶

First, I assign the URL or file path of the dataset to a variable. Then, I create a list of column headers based on the dataset documentation and use it when loading the data with pandas.

First, we assign the URL of the dataset to "filename".

The dataset for this project is stored locally. If you need the data, you can find similar car datasets from public sources or repositories.

In [3]:
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"

Then, we create a Python list headers containing name of headers.

In [4]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

you will need to download the dataset; if you are running locally, please comment out the following

In [5]:
# await download(filename, "auto.csv")
# filename="auto.csv"

Use the Pandas method read_csv() to load the data from the web address. Set the parameter "names" equal to the Python list "headers".

In [6]:
df = pd.read_csv(filename, names = headers)

I'll use the head() method to preview the first few rows of the DataFrame and check for any obvious issues.

In [7]:
# To see what the data set looks like, we'll use the head() method.
df.head()
Out[7]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

As you can see, there are missing values in the DataFrame, often represented by question marks. Handling missing data is a crucial part of my data cleaning process.

So, how do we identify all those missing values and deal with them?

How to work with missing data?

Steps for working with missing data:

  1. Identify missing data
  2. Deal with missing data
  3. Correct data format

Identify and Handle Missing Values¶

Step 1: Identify Missing Data¶

In this dataset, missing values are represented by the question mark ?. I replace these with NaN (Not a Number), which is pandas' default marker for missing values. This makes it easier to handle missing data in subsequent steps.

In [8]:
import numpy as np

# replace "?" to NaN
df.replace("?", np.nan, inplace = True)
df.head(5)
Out[8]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

Evaluating for Missing Data

The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:

  1. .isnull()
  2. .notnull()
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data.
In [9]:
missing_data = df.isnull()
missing_data.head(5)
Out[9]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 False True False False False False False False False False ... False False False False False False False False False False
1 False True False False False False False False False False ... False False False False False False False False False False
2 False True False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False

5 rows × 26 columns

"True" means the value is a missing value while "False" means the value is not a missing value.

Count missing values in each column

Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset. In the body of the for loop the method ".value_counts()" counts the number of "True" values.

In [10]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")    
symboling
False    205
Name: symboling, dtype: int64

normalized-losses
False    164
True      41
Name: normalized-losses, dtype: int64

make
False    205
Name: make, dtype: int64

fuel-type
False    205
Name: fuel-type, dtype: int64

aspiration
False    205
Name: aspiration, dtype: int64

num-of-doors
False    203
True       2
Name: num-of-doors, dtype: int64

body-style
False    205
Name: body-style, dtype: int64

drive-wheels
False    205
Name: drive-wheels, dtype: int64

engine-location
False    205
Name: engine-location, dtype: int64

wheel-base
False    205
Name: wheel-base, dtype: int64

length
False    205
Name: length, dtype: int64

width
False    205
Name: width, dtype: int64

height
False    205
Name: height, dtype: int64

curb-weight
False    205
Name: curb-weight, dtype: int64

engine-type
False    205
Name: engine-type, dtype: int64

num-of-cylinders
False    205
Name: num-of-cylinders, dtype: int64

engine-size
False    205
Name: engine-size, dtype: int64

fuel-system
False    205
Name: fuel-system, dtype: int64

bore
False    201
True       4
Name: bore, dtype: int64

stroke
False    201
True       4
Name: stroke, dtype: int64

compression-ratio
False    205
Name: compression-ratio, dtype: int64

horsepower
False    203
True       2
Name: horsepower, dtype: int64

peak-rpm
False    203
True       2
Name: peak-rpm, dtype: int64

city-mpg
False    205
Name: city-mpg, dtype: int64

highway-mpg
False    205
Name: highway-mpg, dtype: int64

price
False    201
True       4
Name: price, dtype: int64

Based on the summary above, each column has 205 rows of data and seven of the columns containing missing data:

  1. "normalized-losses": 41 missing data
  2. "num-of-doors": 2 missing data
  3. "bore": 4 missing data
  4. "stroke" : 4 missing data
  5. "horsepower": 2 missing data
  6. "peak-rpm": 2 missing data
  7. "price": 4 missing data

Deal with missing data

How to deal with missing data?
  1. Drop data
    a. Drop the whole row
    b. Drop the whole column
  2. Replace data
    a. Replace it by mean
    b. Replace it by frequency
    c. Replace it based on other functions

Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely. We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:

Replace by mean:

  • "normalized-losses": 41 missing data, replace them with mean
  • "stroke": 4 missing data, replace them with mean
  • "bore": 4 missing data, replace them with mean
  • "horsepower": 2 missing data, replace them with mean
  • "peak-rpm": 2 missing data, replace them with mean

Replace by frequency:

  • "num-of-doors": 2 missing data, replace them with "four".
    • Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur

Drop the whole row:

  • "price": 4 missing data, simply delete the whole row
    • Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us

Calculate the mean value for the "normalized-losses" column

In [11]:
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
Average of normalized-losses: 122.0

Replace "NaN" with mean value in "normalized-losses" column

In [12]:
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)

Calculate the mean value for the "bore" column

In [13]:
avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)
Average of bore: 3.3297512437810957

Replace "NaN" with the mean value in the "bore" column

In [14]:
df["bore"].replace(np.nan, avg_bore, inplace=True)

Question #1:

Based on the example above, replace NaN in "stroke" column with the mean value.

In [15]:
# Write your code below and press Shift+Enter to execute 
avg_stroke=df['stroke'].astype('float').mean(axis=0)
print("Average of stroke:", avg_stroke)
df["stroke"].replace(np.nan, avg_stroke, inplace=True)
Average of stroke: 3.2554228855721337
Click here for the solution
#Calculate the mean vaule for "stroke" column
avg_stroke = df["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)

# replace NaN by mean value in "stroke" column
df["stroke"].replace(np.nan, avg_stroke, inplace = True)

Calculate the mean value for the "horsepower" column

In [16]:
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)
Average horsepower: 104.25615763546799

Replace "NaN" with the mean value in the "horsepower" column

In [17]:
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)

Calculate the mean value for "peak-rpm" column

In [18]:
avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)
Average peak rpm: 5125.369458128079

Replace "NaN" with the mean value in the "peak-rpm" column

In [19]:
df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)

To see which values are present in a particular column, we can use the ".value_counts()" method:

In [20]:
df['num-of-doors'].value_counts()
Out[20]:
four    114
two      89
Name: num-of-doors, dtype: int64

Mode¶

We can see that four doors are the most common type. We can also use the ".idxmax()" method to calculate the most common type automatically:

In [21]:
df['num-of-doors'].value_counts().idxmax()
Out[21]:
'four'

The replacement procedure is very similar to what we have seen previously:

In [22]:
#replace the missing 'num-of-doors' values by the most frequent 
df["num-of-doors"].replace(np.nan, "four", inplace=True)

Finally, let's drop all rows that do not have price data:

In [23]:
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)
In [24]:
df.head()
Out[24]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 122.0 alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450

5 rows × 26 columns

Good! Now, we have a dataset with no missing values.

Correct data format

We are almost there!

The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).

In Pandas, we use:

.dtype() to check the data type

.astype() to change the data type

Let's list the data types for each column

In [25]:
df.dtypes
Out[25]:
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the "astype()" method.

Convert data types to proper format

In [26]:
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")

Let us list the columns after the conversion

In [27]:
df.dtypes
Out[27]:
symboling              int64
normalized-losses      int32
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower            object
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
dtype: object

Wonderful!

Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.

Data Standardization

Data is usually collected from different agencies in different formats. (Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)

What is standardization?

Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.

Example

Transform mpg to L/100km:

In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accepts the fuel consumption with L/100km standard.

We will need to apply data transformation to transform mpg into L/100km.

The formula for unit conversion is:

L/100km = 235 / mpg

We can do many mathematical operations directly in Pandas.

In [28]:
df.head()
Out[28]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000.0 21 27 13495.0
1 3 122 alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000.0 21 27 16500.0
2 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000.0 19 26 16500.0
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500.0 24 30 13950.0
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500.0 18 22 17450.0

5 rows × 26 columns

In [29]:
# Convert mpg to L/100km by mathematical operation (235 divided by mpg)
df['city-L/100km'] = 235/df["city-mpg"]

# check your transformed data 
df.head()
Out[29]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 27 13495.0 11.190476
1 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 27 16500.0 11.190476
2 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... mpfi 2.68 3.47 9.0 154 5000.0 19 26 16500.0 12.368421
3 2 164 audi gas std four sedan fwd front 99.8 ... mpfi 3.19 3.40 10.0 102 5500.0 24 30 13950.0 9.791667
4 2 164 audi gas std four sedan 4wd front 99.4 ... mpfi 3.19 3.40 8.0 115 5500.0 18 22 17450.0 13.055556

5 rows × 27 columns

Question #2:

According to the example above, transform mpg to L/100km in the column of "highway-mpg" and change the name of column to "highway-L/100km".

In [30]:
# Write your code below and press Shift+Enter to execute 
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df["highway-mpg"] = 235/df["highway-mpg"]

# rename column name from "highway-mpg" to "highway-L/100km"
df.rename(columns={'"highway-mpg"':'highway-L/100km'}, inplace=True)
# check your transformed data 
df.head()
Out[30]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base ... fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km
0 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 8.703704 13495.0 11.190476
1 3 122 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111 5000.0 21 8.703704 16500.0 11.190476
2 1 122 alfa-romero gas std two hatchback rwd front 94.5 ... mpfi 2.68 3.47 9.0 154 5000.0 19 9.038462 16500.0 12.368421
3 2 164 audi gas std four sedan fwd front 99.8 ... mpfi 3.19 3.40 10.0 102 5500.0 24 7.833333 13950.0 9.791667
4 2 164 audi gas std four sedan 4wd front 99.4 ... mpfi 3.19 3.40 8.0 115 5500.0 18 10.681818 17450.0 13.055556

5 rows × 27 columns

Click here for the solution
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df["highway-mpg"] = 235/df["highway-mpg"]

# rename column name from "highway-mpg" to "highway-L/100km"
df.rename(columns={'"highway-mpg"':'highway-L/100km'}, inplace=True)

# check your transformed data 
df.head()

Data Normalization

Why normalization?

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.

Example

To demonstrate normalization, let's say we want to scale the columns "length", "width" and "height".

Target: would like to normalize those variables so their value ranges from 0 to 1

Approach: replace original value by (original value)/(maximum value)

In [31]:
# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()

Question #3:

According to the example above, normalize the column "height".

In [32]:
# Write your code below and press Shift+Enter to execute 
df['height'] = df['height']/df['height'].max()
Click here for the solution
df['height'] = df['height']/df['height'].max() 

# show the scaled columns
df[["length","width","height"]].head()

Here we can see we've normalized "length", "width" and "height" in the range of [0,1].

Binning

Why binning?

Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.

Example:

In our dataset, "horsepower" is a real valued variable ranging from 48 to 288 and it has 59 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?

We will use the pandas method 'cut' to segment the 'horsepower' column into 3 bins.

Example of Binning Data In Pandas

Convert data to correct format:

In [33]:
df["horsepower"]=df["horsepower"].astype(int, copy=True)

Let's plot the histogram of horsepower to see what the distribution of horsepower looks like.

In [34]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Out[34]:
Text(0.5, 1.0, 'horsepower bins')
No description has been provided for this image

We would like 3 bins of equal size bandwidth so we use numpy's linspace(start_value, end_value, numbers_generated function.

Since we want to include the minimum value of horsepower, we want to set start_value = min(df["horsepower"]).

Since we want to include the maximum value of horsepower, we want to set end_value = max(df["horsepower"]).

Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.

We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.

In [35]:
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
bins
Out[35]:
array([ 48.        , 119.33333333, 190.66666667, 262.        ])

We set group names:

In [36]:
group_names = ['Low', 'Medium', 'High']

We apply the function "cut" to determine what each value of df['horsepower'] belongs to.

In [37]:
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)
Out[37]:
horsepower horsepower-binned
0 111 Low
1 111 Low
2 154 Medium
3 102 Low
4 115 Low
5 110 Low
6 110 Low
7 110 Low
8 140 Medium
9 101 Low
10 101 Low
11 121 Medium
12 121 Medium
13 121 Medium
14 182 Medium
15 182 Medium
16 182 Medium
17 48 Low
18 70 Low
19 70 Low

Let's see the number of vehicles in each bin:

In [38]:
df["horsepower-binned"].value_counts()
Out[38]:
Low       153
Medium     43
High        5
Name: horsepower-binned, dtype: int64

Let's plot the distribution of each bin:

In [39]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Out[39]:
Text(0.5, 1.0, 'horsepower bins')
No description has been provided for this image

Look at the dataframe above carefully. You will find that the last column provides the bins for "horsepower" based on 3 categories ("Low", "Medium" and "High").

We successfully narrowed down the intervals from 59 to 3!

Bins Visualization

Normally, a histogram is used to visualize the distribution of bins we created above.
In [40]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot


# draw historgram of attribute "horsepower" with bins = 3
plt.pyplot.hist(df["horsepower"], bins = 3)

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Out[40]:
Text(0.5, 1.0, 'horsepower bins')
No description has been provided for this image

The plot above shows the binning result for the attribute "horsepower".

Indicator Variable (or Dummy Variable)

What is an indicator variable?

An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning.

Why we use indicator variables?

We use indicator variables so we can use categorical variables for regression analysis in the later modules.

Example

We see the column "fuel-type" has two unique values: "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" to indicator variables.

We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type.

In [41]:
df.columns
Out[41]:
Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price', 'city-L/100km', 'horsepower-binned'],
      dtype='object')

Get the indicator variables and assign it to data frame "dummy_variable_1":

In [42]:
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()
Out[42]:
diesel gas
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1

Change the column names for clarity:

In [43]:
dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_variable_1.head()
Out[43]:
fuel-type-diesel fuel-type-gas
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1

Merge Dataframe¶

In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.

In [44]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)
In [45]:
df.head()
Out[45]:
symboling normalized-losses make aspiration num-of-doors body-style drive-wheels engine-location wheel-base length ... compression-ratio horsepower peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned fuel-type-diesel fuel-type-gas
0 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111 5000.0 21 8.703704 13495.0 11.190476 Low 0 1
1 3 122 alfa-romero std two convertible rwd front 88.6 0.811148 ... 9.0 111 5000.0 21 8.703704 16500.0 11.190476 Low 0 1
2 1 122 alfa-romero std two hatchback rwd front 94.5 0.822681 ... 9.0 154 5000.0 19 9.038462 16500.0 12.368421 Medium 0 1
3 2 164 audi std four sedan fwd front 99.8 0.848630 ... 10.0 102 5500.0 24 7.833333 13950.0 9.791667 Low 0 1
4 2 164 audi std four sedan 4wd front 99.4 0.848630 ... 8.0 115 5500.0 18 10.681818 17450.0 13.055556 Low 0 1

5 rows × 29 columns

The last two columns are now the indicator variable representation of the fuel-type variable. They're all 0s and 1s now.

Question #4:

Similar to before, create an indicator variable for the column "aspiration"

In [47]:
# Write your code below and press Shift+Enter to execute 
dummy_variable_2 = pd.get_dummies(df["aspiration"])
# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)
dummy_variable_2.head()
Out[47]:
aspiration-std aspiration-turbo
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
Click here for the solution
# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['aspiration'])

# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)

# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()

Question #5:

Merge the new dataframe to the original dataframe, then drop the column 'aspiration'.

In [48]:
# Write your code below and press Shift+Enter to execute 
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_2], axis=1)

# drop original column "fuel-type" from "df"
df.drop("aspiration", axis = 1, inplace=True)
df.head()
Out[48]:
symboling normalized-losses make num-of-doors body-style drive-wheels engine-location wheel-base length width ... peak-rpm city-mpg highway-mpg price city-L/100km horsepower-binned fuel-type-diesel fuel-type-gas aspiration-std aspiration-turbo
0 3 122 alfa-romero two convertible rwd front 88.6 0.811148 0.890278 ... 5000.0 21 8.703704 13495.0 11.190476 Low 0 1 1 0
1 3 122 alfa-romero two convertible rwd front 88.6 0.811148 0.890278 ... 5000.0 21 8.703704 16500.0 11.190476 Low 0 1 1 0
2 1 122 alfa-romero two hatchback rwd front 94.5 0.822681 0.909722 ... 5000.0 19 9.038462 16500.0 12.368421 Medium 0 1 1 0
3 2 164 audi four sedan fwd front 99.8 0.848630 0.919444 ... 5500.0 24 7.833333 13950.0 9.791667 Low 0 1 1 0
4 2 164 audi four sedan 4wd front 99.4 0.848630 0.922222 ... 5500.0 18 10.681818 17450.0 13.055556 Low 0 1 1 0

5 rows × 30 columns

Click here for the solution
# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)

# drop original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)

Save the Dataset¶

In [49]:
df.to_csv('clean_df.csv')

Save the new csv


This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.

Thank you for exploring my approach to data wrangling and cleaning! If you have any feedback or suggestions, feel free to reach out.