Automobile Data Wrangling & Preprocessing: Professional Pipeline¶
By Mohammad Sayem Chowdhury
Last Updated: June 13, 2025
Project Overview¶
This notebook demonstrates a comprehensive, production-ready data wrangling and preprocessing pipeline specifically designed for automobile datasets. Through systematic data cleaning, validation, and transformation processes, I prepare raw automotive data for advanced statistical analysis and machine learning applications.
Data Quality Objectives¶
- Completeness: Handle missing values using domain-appropriate imputation strategies
- Consistency: Standardize data formats and ensure uniform data types
- Accuracy: Detect and correct outliers and anomalous values
- Usability: Transform data into ML-ready format with proper feature engineering
Professional Standards¶
This preprocessing pipeline follows industry best practices for data quality assurance, featuring comprehensive documentation, validation checkpoints, and reproducible transformation procedures suitable for production environments.
Author: Mohammad Sayem Chowdhury
Project Type: Data Engineering & Preprocessing
Domain: Automotive Data Analytics
Framework: Production-Ready Data Pipeline
Table of Contents¶
Environment Setup & Data Import
- Library configuration and imports
- Dataset loading and initial inspection
- Data quality assessment framework
Missing Value Analysis & Treatment
- Missing value identification and patterns
- Domain-specific imputation strategies
- Validation of imputation results
Data Type Optimization & Format Correction
- Data type validation and conversion
- Format standardization procedures
- Categorical encoding preparation
Data Standardization & Normalization
- Feature scaling methodologies
- Distribution normalization techniques
- Cross-feature consistency checks
Feature Engineering & Transformation
- Binning and discretization strategies
- Indicator variable creation
- Advanced feature construction
Data Quality Validation & Export
- Final quality assurance checks
- Performance optimization
- Clean dataset export procedures
Executive Summary¶
The Critical Role of Data Wrangling¶
Data wrangling represents the foundation of any successful analytics project, typically consuming 60-80% of a data scientist's time. This systematic approach to data preparation ensures that downstream analysis and modeling efforts are built upon a solid, reliable foundation.
Professional Approach to Data Preprocessing¶
This notebook demonstrates enterprise-grade data wrangling techniques specifically tailored for automotive datasets. My methodology emphasizes:
- Systematic Quality Assessment: Comprehensive evaluation of data integrity issues
- Domain-Aware Processing: Automotive industry-specific validation rules and transformations
- Reproducible Procedures: Fully documented and version-controlled preprocessing steps
- Performance Optimization: Efficient processing suitable for large-scale production datasets
Business Impact: Clean, well-structured data directly translates to more accurate models, reliable insights, and trustworthy business decisions.
What is the fuel consumption (L/100k) rate for the diesel car?
Import Data¶
For this project, I'll use a car dataset in CSV format. You can find similar datasets from public sources or repositories. I'll demonstrate how I load and inspect the data using pandas.
Import pandas
If you need to install any libraries, use pip or conda as appropriate for your environment.¶
import piplite
await piplite.install(['pandas'])
await piplite.install(['matplotlib'])
If you run the lab locally, you can load the correct library and versions by uncommenting the following:
Uncomment and use pip or conda to install specific versions if needed.¶
#If you run the lab locally using Anaconda, you can load the correct library and versions by uncommenting the following:
#install specific version of libraries used in lab
#! mamba install pandas==1.3.3
#! mamba install numpy=1.21.2
import pandas as pd
import matplotlib.pylab as plt
This function can be used to download the dataset if needed. For my local analysis, I keep the data file in my working directory.
#This function will download the dataset into your browser
# from pyodide.http import pyfetch
# async def download(url, filename):
# response = await pyfetch(url)
# if response.status == 200:
# with open(filename, "wb") as f:
# f.write(await response.bytes())
Reading the Dataset and Adding Headers¶
First, I assign the URL or file path of the dataset to a variable. Then, I create a list of column headers based on the dataset documentation and use it when loading the data with pandas.
First, we assign the URL of the dataset to "filename".
The dataset for this project is stored locally. If you need the data, you can find similar car datasets from public sources or repositories.
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"
Then, we create a Python list headers containing name of headers.
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
"num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]
you will need to download the dataset; if you are running locally, please comment out the following
# await download(filename, "auto.csv")
# filename="auto.csv"
Use the Pandas method read_csv() to load the data from the web address. Set the parameter "names" equal to the Python list "headers".
df = pd.read_csv(filename, names = headers)
I'll use the head() method to preview the first few rows of the DataFrame and check for any obvious issues.
# To see what the data set looks like, we'll use the head() method.
df.head()
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
As you can see, there are missing values in the DataFrame, often represented by question marks. Handling missing data is a crucial part of my data cleaning process.
How to work with missing data?
Steps for working with missing data:
- Identify missing data
- Deal with missing data
- Correct data format
import numpy as np
# replace "?" to NaN
df.replace("?", np.nan, inplace = True)
df.head(5)
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | NaN | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | NaN | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
Evaluating for Missing Data
The missing values are converted by default. We use the following functions to identify these missing values. There are two methods to detect missing data:
- .isnull()
- .notnull()
missing_data = df.isnull()
missing_data.head(5)
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 2 | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
5 rows × 26 columns
"True" means the value is a missing value while "False" means the value is not a missing value.
Count missing values in each column
Using a for loop in Python, we can quickly figure out the number of missing values in each column. As mentioned above, "True" represents a missing value and "False" means the value is present in the dataset. In the body of the for loop the method ".value_counts()" counts the number of "True" values.
for column in missing_data.columns.values.tolist():
print(column)
print (missing_data[column].value_counts())
print("")
symboling False 205 Name: symboling, dtype: int64 normalized-losses False 164 True 41 Name: normalized-losses, dtype: int64 make False 205 Name: make, dtype: int64 fuel-type False 205 Name: fuel-type, dtype: int64 aspiration False 205 Name: aspiration, dtype: int64 num-of-doors False 203 True 2 Name: num-of-doors, dtype: int64 body-style False 205 Name: body-style, dtype: int64 drive-wheels False 205 Name: drive-wheels, dtype: int64 engine-location False 205 Name: engine-location, dtype: int64 wheel-base False 205 Name: wheel-base, dtype: int64 length False 205 Name: length, dtype: int64 width False 205 Name: width, dtype: int64 height False 205 Name: height, dtype: int64 curb-weight False 205 Name: curb-weight, dtype: int64 engine-type False 205 Name: engine-type, dtype: int64 num-of-cylinders False 205 Name: num-of-cylinders, dtype: int64 engine-size False 205 Name: engine-size, dtype: int64 fuel-system False 205 Name: fuel-system, dtype: int64 bore False 201 True 4 Name: bore, dtype: int64 stroke False 201 True 4 Name: stroke, dtype: int64 compression-ratio False 205 Name: compression-ratio, dtype: int64 horsepower False 203 True 2 Name: horsepower, dtype: int64 peak-rpm False 203 True 2 Name: peak-rpm, dtype: int64 city-mpg False 205 Name: city-mpg, dtype: int64 highway-mpg False 205 Name: highway-mpg, dtype: int64 price False 201 True 4 Name: price, dtype: int64
Based on the summary above, each column has 205 rows of data and seven of the columns containing missing data:
- "normalized-losses": 41 missing data
- "num-of-doors": 2 missing data
- "bore": 4 missing data
- "stroke" : 4 missing data
- "horsepower": 2 missing data
- "peak-rpm": 2 missing data
- "price": 4 missing data
Deal with missing data
How to deal with missing data?- Drop data
a. Drop the whole row
b. Drop the whole column - Replace data
a. Replace it by mean
b. Replace it by frequency
c. Replace it based on other functions
Whole columns should be dropped only if most entries in the column are empty. In our dataset, none of the columns are empty enough to drop entirely. We have some freedom in choosing which method to replace data; however, some methods may seem more reasonable than others. We will apply each method to many different columns:
Replace by mean:
- "normalized-losses": 41 missing data, replace them with mean
- "stroke": 4 missing data, replace them with mean
- "bore": 4 missing data, replace them with mean
- "horsepower": 2 missing data, replace them with mean
- "peak-rpm": 2 missing data, replace them with mean
Replace by frequency:
- "num-of-doors": 2 missing data, replace them with "four".
- Reason: 84% sedans is four doors. Since four doors is most frequent, it is most likely to occur
Drop the whole row:
- "price": 4 missing data, simply delete the whole row
- Reason: price is what we want to predict. Any data entry without price data cannot be used for prediction; therefore any row now without price data is not useful to us
Calculate the mean value for the "normalized-losses" column
avg_norm_loss = df["normalized-losses"].astype("float").mean(axis=0)
print("Average of normalized-losses:", avg_norm_loss)
Average of normalized-losses: 122.0
Replace "NaN" with mean value in "normalized-losses" column
df["normalized-losses"].replace(np.nan, avg_norm_loss, inplace=True)
Calculate the mean value for the "bore" column
avg_bore=df['bore'].astype('float').mean(axis=0)
print("Average of bore:", avg_bore)
Average of bore: 3.3297512437810957
Replace "NaN" with the mean value in the "bore" column
df["bore"].replace(np.nan, avg_bore, inplace=True)
Question #1:
Based on the example above, replace NaN in "stroke" column with the mean value.
# Write your code below and press Shift+Enter to execute
avg_stroke=df['stroke'].astype('float').mean(axis=0)
print("Average of stroke:", avg_stroke)
df["stroke"].replace(np.nan, avg_stroke, inplace=True)
Average of stroke: 3.2554228855721337
Click here for the solution
#Calculate the mean vaule for "stroke" column
avg_stroke = df["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)
# replace NaN by mean value in "stroke" column
df["stroke"].replace(np.nan, avg_stroke, inplace = True)
Calculate the mean value for the "horsepower" column
avg_horsepower = df['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)
Average horsepower: 104.25615763546799
Replace "NaN" with the mean value in the "horsepower" column
df['horsepower'].replace(np.nan, avg_horsepower, inplace=True)
Calculate the mean value for "peak-rpm" column
avg_peakrpm=df['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)
Average peak rpm: 5125.369458128079
Replace "NaN" with the mean value in the "peak-rpm" column
df['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)
To see which values are present in a particular column, we can use the ".value_counts()" method:
df['num-of-doors'].value_counts()
four 114 two 89 Name: num-of-doors, dtype: int64
Mode¶
We can see that four doors are the most common type. We can also use the ".idxmax()" method to calculate the most common type automatically:
df['num-of-doors'].value_counts().idxmax()
'four'
The replacement procedure is very similar to what we have seen previously:
#replace the missing 'num-of-doors' values by the most frequent
df["num-of-doors"].replace(np.nan, "four", inplace=True)
Finally, let's drop all rows that do not have price data:
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace=True)
# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)
df.head()
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | 3 | 122.0 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | 1 | 122.0 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
Good! Now, we have a dataset with no missing values.
Correct data format
We are almost there!The last step in data cleaning is checking and making sure that all data is in the correct format (int, float, text or other).
In Pandas, we use:
.dtype() to check the data type
.astype() to change the data type
Let's list the data types for each column
df.dtypes
symboling int64 normalized-losses object make object fuel-type object aspiration object num-of-doors object body-style object drive-wheels object engine-location object wheel-base float64 length float64 width float64 height float64 curb-weight int64 engine-type object num-of-cylinders object engine-size int64 fuel-system object bore object stroke object compression-ratio float64 horsepower object peak-rpm object city-mpg int64 highway-mpg int64 price object dtype: object
As we can see above, some columns are not of the correct data type. Numerical variables should have type 'float' or 'int', and variables with strings such as categories should have type 'object'. For example, 'bore' and 'stroke' variables are numerical values that describe the engines, so we should expect them to be of the type 'float' or 'int'; however, they are shown as type 'object'. We have to convert data types into a proper format for each column using the "astype()" method.
Convert data types to proper format
df[["bore", "stroke"]] = df[["bore", "stroke"]].astype("float")
df[["normalized-losses"]] = df[["normalized-losses"]].astype("int")
df[["price"]] = df[["price"]].astype("float")
df[["peak-rpm"]] = df[["peak-rpm"]].astype("float")
Let us list the columns after the conversion
df.dtypes
symboling int64 normalized-losses int32 make object fuel-type object aspiration object num-of-doors object body-style object drive-wheels object engine-location object wheel-base float64 length float64 width float64 height float64 curb-weight int64 engine-type object num-of-cylinders object engine-size int64 fuel-system object bore float64 stroke float64 compression-ratio float64 horsepower object peak-rpm float64 city-mpg int64 highway-mpg int64 price float64 dtype: object
Wonderful!
Now we have finally obtained the cleaned dataset with no missing values with all data in its proper format.
Data Standardization
Data is usually collected from different agencies in different formats. (Data standardization is also a term for a particular type of data normalization where we subtract the mean and divide by the standard deviation.)
What is standardization?
Standardization is the process of transforming data into a common format, allowing the researcher to make the meaningful comparison.
Example
Transform mpg to L/100km:
In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit. Assume we are developing an application in a country that accepts the fuel consumption with L/100km standard.
We will need to apply data transformation to transform mpg into L/100km.
The formula for unit conversion is:
L/100km = 235 / mpg
We can do many mathematical operations directly in Pandas.
df.head()
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | engine-size | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 13495.0 |
| 1 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 16500.0 |
| 2 | 1 | 122 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000.0 | 19 | 26 | 16500.0 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500.0 | 24 | 30 | 13950.0 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500.0 | 18 | 22 | 17450.0 |
5 rows × 26 columns
# Convert mpg to L/100km by mathematical operation (235 divided by mpg)
df['city-L/100km'] = 235/df["city-mpg"]
# check your transformed data
df.head()
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 13495.0 | 11.190476 |
| 1 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 27 | 16500.0 | 11.190476 |
| 2 | 1 | 122 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000.0 | 19 | 26 | 16500.0 | 12.368421 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500.0 | 24 | 30 | 13950.0 | 9.791667 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500.0 | 18 | 22 | 17450.0 | 13.055556 |
5 rows × 27 columns
Question #2:
According to the example above, transform mpg to L/100km in the column of "highway-mpg" and change the name of column to "highway-L/100km".
# Write your code below and press Shift+Enter to execute
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df["highway-mpg"] = 235/df["highway-mpg"]
# rename column name from "highway-mpg" to "highway-L/100km"
df.rename(columns={'"highway-mpg"':'highway-L/100km'}, inplace=True)
# check your transformed data
df.head()
| symboling | normalized-losses | make | fuel-type | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | ... | fuel-system | bore | stroke | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 8.703704 | 13495.0 | 11.190476 |
| 1 | 3 | 122 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000.0 | 21 | 8.703704 | 16500.0 | 11.190476 |
| 2 | 1 | 122 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000.0 | 19 | 9.038462 | 16500.0 | 12.368421 |
| 3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500.0 | 24 | 7.833333 | 13950.0 | 9.791667 |
| 4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500.0 | 18 | 10.681818 | 17450.0 | 13.055556 |
5 rows × 27 columns
Click here for the solution
# transform mpg to L/100km by mathematical operation (235 divided by mpg)
df["highway-mpg"] = 235/df["highway-mpg"]
# rename column name from "highway-mpg" to "highway-L/100km"
df.rename(columns={'"highway-mpg"':'highway-L/100km'}, inplace=True)
# check your transformed data
df.head()
Data Normalization
Why normalization?
Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
Example
To demonstrate normalization, let's say we want to scale the columns "length", "width" and "height".
Target: would like to normalize those variables so their value ranges from 0 to 1
Approach: replace original value by (original value)/(maximum value)
# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()
Question #3:
According to the example above, normalize the column "height".
# Write your code below and press Shift+Enter to execute
df['height'] = df['height']/df['height'].max()
Click here for the solution
df['height'] = df['height']/df['height'].max()
# show the scaled columns
df[["length","width","height"]].head()
Here we can see we've normalized "length", "width" and "height" in the range of [0,1].
Binning
Why binning?Binning is a process of transforming continuous numerical variables into discrete categorical 'bins' for grouped analysis.
Example:
In our dataset, "horsepower" is a real valued variable ranging from 48 to 288 and it has 59 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins' to simplify analysis?
We will use the pandas method 'cut' to segment the 'horsepower' column into 3 bins.
Example of Binning Data In Pandas
Convert data to correct format:
df["horsepower"]=df["horsepower"].astype(int, copy=True)
Let's plot the histogram of horsepower to see what the distribution of horsepower looks like.
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Text(0.5, 1.0, 'horsepower bins')
We would like 3 bins of equal size bandwidth so we use numpy's linspace(start_value, end_value, numbers_generated function.
Since we want to include the minimum value of horsepower, we want to set start_value = min(df["horsepower"]).
Since we want to include the maximum value of horsepower, we want to set end_value = max(df["horsepower"]).
Since we are building 3 bins of equal length, there should be 4 dividers, so numbers_generated = 4.
We build a bin array with a minimum value to a maximum value by using the bandwidth calculated above. The values will determine when one bin ends and another begins.
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
bins
array([ 48. , 119.33333333, 190.66666667, 262. ])
We set group names:
group_names = ['Low', 'Medium', 'High']
We apply the function "cut" to determine what each value of df['horsepower'] belongs to.
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)
| horsepower | horsepower-binned | |
|---|---|---|
| 0 | 111 | Low |
| 1 | 111 | Low |
| 2 | 154 | Medium |
| 3 | 102 | Low |
| 4 | 115 | Low |
| 5 | 110 | Low |
| 6 | 110 | Low |
| 7 | 110 | Low |
| 8 | 140 | Medium |
| 9 | 101 | Low |
| 10 | 101 | Low |
| 11 | 121 | Medium |
| 12 | 121 | Medium |
| 13 | 121 | Medium |
| 14 | 182 | Medium |
| 15 | 182 | Medium |
| 16 | 182 | Medium |
| 17 | 48 | Low |
| 18 | 70 | Low |
| 19 | 70 | Low |
Let's see the number of vehicles in each bin:
df["horsepower-binned"].value_counts()
Low 153 Medium 43 High 5 Name: horsepower-binned, dtype: int64
Let's plot the distribution of each bin:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
pyplot.bar(group_names, df["horsepower-binned"].value_counts())
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Text(0.5, 1.0, 'horsepower bins')
Look at the dataframe above carefully. You will find that the last column provides the bins for "horsepower" based on 3 categories ("Low", "Medium" and "High").
We successfully narrowed down the intervals from 59 to 3!
Bins Visualization
Normally, a histogram is used to visualize the distribution of bins we created above.%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
# draw historgram of attribute "horsepower" with bins = 3
plt.pyplot.hist(df["horsepower"], bins = 3)
# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
Text(0.5, 1.0, 'horsepower bins')
The plot above shows the binning result for the attribute "horsepower".
Indicator Variable (or Dummy Variable)
What is an indicator variable?An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning.
Why we use indicator variables?
We use indicator variables so we can use categorical variables for regression analysis in the later modules.
ExampleWe see the column "fuel-type" has two unique values: "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" to indicator variables.
We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type.
df.columns
Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price', 'city-L/100km', 'horsepower-binned'],
dtype='object')
Get the indicator variables and assign it to data frame "dummy_variable_1":
dummy_variable_1 = pd.get_dummies(df["fuel-type"])
dummy_variable_1.head()
| diesel | gas | |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 0 | 1 |
| 2 | 0 | 1 |
| 3 | 0 | 1 |
| 4 | 0 | 1 |
Change the column names for clarity:
dummy_variable_1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_variable_1.head()
| fuel-type-diesel | fuel-type-gas | |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 0 | 1 |
| 2 | 0 | 1 |
| 3 | 0 | 1 |
| 4 | 0 | 1 |
Merge Dataframe¶
In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.
# merge data frame "df" and "dummy_variable_1"
df = pd.concat([df, dummy_variable_1], axis=1)
# drop original column "fuel-type" from "df"
df.drop("fuel-type", axis = 1, inplace=True)
df.head()
| symboling | normalized-losses | make | aspiration | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | ... | compression-ratio | horsepower | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | horsepower-binned | fuel-type-diesel | fuel-type-gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 9.0 | 111 | 5000.0 | 21 | 8.703704 | 13495.0 | 11.190476 | Low | 0 | 1 |
| 1 | 3 | 122 | alfa-romero | std | two | convertible | rwd | front | 88.6 | 0.811148 | ... | 9.0 | 111 | 5000.0 | 21 | 8.703704 | 16500.0 | 11.190476 | Low | 0 | 1 |
| 2 | 1 | 122 | alfa-romero | std | two | hatchback | rwd | front | 94.5 | 0.822681 | ... | 9.0 | 154 | 5000.0 | 19 | 9.038462 | 16500.0 | 12.368421 | Medium | 0 | 1 |
| 3 | 2 | 164 | audi | std | four | sedan | fwd | front | 99.8 | 0.848630 | ... | 10.0 | 102 | 5500.0 | 24 | 7.833333 | 13950.0 | 9.791667 | Low | 0 | 1 |
| 4 | 2 | 164 | audi | std | four | sedan | 4wd | front | 99.4 | 0.848630 | ... | 8.0 | 115 | 5500.0 | 18 | 10.681818 | 17450.0 | 13.055556 | Low | 0 | 1 |
5 rows × 29 columns
The last two columns are now the indicator variable representation of the fuel-type variable. They're all 0s and 1s now.
Question #4:
Similar to before, create an indicator variable for the column "aspiration"
# Write your code below and press Shift+Enter to execute
dummy_variable_2 = pd.get_dummies(df["aspiration"])
# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)
dummy_variable_2.head()
| aspiration-std | aspiration-turbo | |
|---|---|---|
| 0 | 1 | 0 |
| 1 | 1 | 0 |
| 2 | 1 | 0 |
| 3 | 1 | 0 |
| 4 | 1 | 0 |
Click here for the solution
# get indicator variables of aspiration and assign it to data frame "dummy_variable_2"
dummy_variable_2 = pd.get_dummies(df['aspiration'])
# change column names for clarity
dummy_variable_2.rename(columns={'std':'aspiration-std', 'turbo': 'aspiration-turbo'}, inplace=True)
# show first 5 instances of data frame "dummy_variable_1"
dummy_variable_2.head()
Question #5:
Merge the new dataframe to the original dataframe, then drop the column 'aspiration'.
# Write your code below and press Shift+Enter to execute
# merge data frame "df" and "dummy_variable_1"
df = pd.concat([df, dummy_variable_2], axis=1)
# drop original column "fuel-type" from "df"
df.drop("aspiration", axis = 1, inplace=True)
df.head()
| symboling | normalized-losses | make | num-of-doors | body-style | drive-wheels | engine-location | wheel-base | length | width | ... | peak-rpm | city-mpg | highway-mpg | price | city-L/100km | horsepower-binned | fuel-type-diesel | fuel-type-gas | aspiration-std | aspiration-turbo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 122 | alfa-romero | two | convertible | rwd | front | 88.6 | 0.811148 | 0.890278 | ... | 5000.0 | 21 | 8.703704 | 13495.0 | 11.190476 | Low | 0 | 1 | 1 | 0 |
| 1 | 3 | 122 | alfa-romero | two | convertible | rwd | front | 88.6 | 0.811148 | 0.890278 | ... | 5000.0 | 21 | 8.703704 | 16500.0 | 11.190476 | Low | 0 | 1 | 1 | 0 |
| 2 | 1 | 122 | alfa-romero | two | hatchback | rwd | front | 94.5 | 0.822681 | 0.909722 | ... | 5000.0 | 19 | 9.038462 | 16500.0 | 12.368421 | Medium | 0 | 1 | 1 | 0 |
| 3 | 2 | 164 | audi | four | sedan | fwd | front | 99.8 | 0.848630 | 0.919444 | ... | 5500.0 | 24 | 7.833333 | 13950.0 | 9.791667 | Low | 0 | 1 | 1 | 0 |
| 4 | 2 | 164 | audi | four | sedan | 4wd | front | 99.4 | 0.848630 | 0.922222 | ... | 5500.0 | 18 | 10.681818 | 17450.0 | 13.055556 | Low | 0 | 1 | 1 | 0 |
5 rows × 30 columns
Click here for the solution
# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variable_2], axis=1)
# drop original column "aspiration" from "df"
df.drop('aspiration', axis = 1, inplace=True)
Save the Dataset¶
df.to_csv('clean_df.csv')
Save the new csv
This notebook and all analysis were created by Mohammad Sayem Chowdhury as a personal data science showcase.
Thank you for exploring my approach to data wrangling and cleaning! If you have any feedback or suggestions, feel free to reach out.