King County Real Estate Market Analysis: Comprehensive Property Valuation Study¶
By Mohammad Sayem Chowdhury
Last Updated: June 13, 2025
Project Overview¶
This comprehensive real estate analytics project examines housing market dynamics in King County, Washington, including the greater Seattle metropolitan area. Using advanced statistical analysis and machine learning techniques, I investigate the key factors driving property values and develop predictive models for accurate home price estimation.
Market Analysis Period¶
Data Coverage: May 2014 - May 2015
Geographic Scope: King County, Washington (including Seattle)
Sample Size: 21,613 residential property sales
Research Objectives¶
- Price Driver Analysis: Identify the most significant factors affecting residential property values
- Market Segmentation: Analyze price patterns across different neighborhoods and property types
- Predictive Modeling: Develop accurate models for property valuation and investment analysis
- Investment Insights: Provide data-driven recommendations for real estate investment strategies
Author: Mohammad Sayem Chowdhury
Project Type: Real Estate Market Analytics
Domain: Property Valuation & Investment Analysis
Methodology: Statistical Analysis & Predictive Modeling
Table of Contents¶
- Market Data Overview & Import
- Property Feature Analysis
- Geographic Price Distribution
- Predictive Model Development
- Investment Insights & Recommendations
Executive Summary¶
King County Real Estate Market Dynamics¶
This analysis focuses on understanding the complex factors that drive residential property values in one of the most dynamic real estate markets in the United States. King County, home to Seattle and major technology companies, presents unique opportunities for data-driven real estate insights.
Key Research Questions¶
- Which property characteristics have the strongest correlation with sale prices?
- How do location factors (geography, proximity to amenities) influence valuation?
- What predictive models can most accurately estimate property values?
- Which areas present the best investment opportunities based on historical data?
Market Context: The 2014-2015 period represents a critical time in the Seattle real estate market, with significant growth driven by the technology sector expansion.
Dataset Features¶
- id: Unique identifier for a house
- date: Date house was sold
- price: Sale price (target variable)
- bedrooms: Number of bedrooms
- bathrooms: Number of bathrooms
- sqft_living: Living area square footage
- sqft_lot: Lot size in square feet
- floors: Number of floors
- waterfront: Whether the house has a waterfront view
- view: Number of times viewed
- condition: Overall condition rating
- grade: Overall grade based on King County grading system
- sqft_above: Square footage above ground
- sqft_basement: Basement square footage
- yr_built: Year built
- yr_renovated: Year renovated
- zipcode: Zip code
- lat: Latitude
- long: Longitude
Notebook by Mohammad Sayem Chowdhury
You will require the following libraries:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler,PolynomialFeatures
from sklearn.linear_model import LinearRegression
%matplotlib inline
Module 1: Importing Data Sets¶
Load the csv:
file_name='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/coursera/project/kc_house_data_NaN.csv'
df=pd.read_csv(file_name)
We use the method head to display the first 5 columns of the dataframe.
df.head()
| Unnamed: 0 | id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 7129300520 | 20141013T000000 | 221900.0 | 3.0 | 1.00 | 1180 | 5650 | 1.0 | 0 | ... | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
| 1 | 1 | 6414100192 | 20141209T000000 | 538000.0 | 3.0 | 2.25 | 2570 | 7242 | 2.0 | 0 | ... | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
| 2 | 2 | 5631500400 | 20150225T000000 | 180000.0 | 2.0 | 1.00 | 770 | 10000 | 1.0 | 0 | ... | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
| 3 | 3 | 2487200875 | 20141209T000000 | 604000.0 | 4.0 | 3.00 | 1960 | 5000 | 1.0 | 0 | ... | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
| 4 | 4 | 1954400510 | 20150218T000000 | 510000.0 | 3.0 | 2.00 | 1680 | 8080 | 1.0 | 0 | ... | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
5 rows × 22 columns
Question 1¶
Display the data types of each column using the attribute dtype, then take a screenshot and submit it, include your code in the image.
print(df.dtypes)
Unnamed: 0 int64 id int64 date object price float64 bedrooms float64 bathrooms float64 sqft_living int64 sqft_lot int64 floors float64 waterfront int64 view int64 condition int64 grade int64 sqft_above int64 sqft_basement int64 yr_built int64 yr_renovated int64 zipcode int64 lat float64 long float64 sqft_living15 int64 sqft_lot15 int64 dtype: object
We use the method describe to obtain a statistical summary of the dataframe.
df.describe()
| Unnamed: 0 | id | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 21613.00000 | 2.161300e+04 | 2.161300e+04 | 21600.000000 | 21603.000000 | 21613.000000 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | ... | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 |
| mean | 10806.00000 | 4.580302e+09 | 5.400881e+05 | 3.372870 | 2.115736 | 2079.899736 | 1.510697e+04 | 1.494309 | 0.007542 | 0.234303 | ... | 7.656873 | 1788.390691 | 291.509045 | 1971.005136 | 84.402258 | 98077.939805 | 47.560053 | -122.213896 | 1986.552492 | 12768.455652 |
| std | 6239.28002 | 2.876566e+09 | 3.671272e+05 | 0.926657 | 0.768996 | 918.440897 | 4.142051e+04 | 0.539989 | 0.086517 | 0.766318 | ... | 1.175459 | 828.090978 | 442.575043 | 29.373411 | 401.679240 | 53.505026 | 0.138564 | 0.140828 | 685.391304 | 27304.179631 |
| min | 0.00000 | 1.000102e+06 | 7.500000e+04 | 1.000000 | 0.500000 | 290.000000 | 5.200000e+02 | 1.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 290.000000 | 0.000000 | 1900.000000 | 0.000000 | 98001.000000 | 47.155900 | -122.519000 | 399.000000 | 651.000000 |
| 25% | 5403.00000 | 2.123049e+09 | 3.219500e+05 | 3.000000 | 1.750000 | 1427.000000 | 5.040000e+03 | 1.000000 | 0.000000 | 0.000000 | ... | 7.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 | 98033.000000 | 47.471000 | -122.328000 | 1490.000000 | 5100.000000 |
| 50% | 10806.00000 | 3.904930e+09 | 4.500000e+05 | 3.000000 | 2.250000 | 1910.000000 | 7.618000e+03 | 1.500000 | 0.000000 | 0.000000 | ... | 7.000000 | 1560.000000 | 0.000000 | 1975.000000 | 0.000000 | 98065.000000 | 47.571800 | -122.230000 | 1840.000000 | 7620.000000 |
| 75% | 16209.00000 | 7.308900e+09 | 6.450000e+05 | 4.000000 | 2.500000 | 2550.000000 | 1.068800e+04 | 2.000000 | 0.000000 | 0.000000 | ... | 8.000000 | 2210.000000 | 560.000000 | 1997.000000 | 0.000000 | 98118.000000 | 47.678000 | -122.125000 | 2360.000000 | 10083.000000 |
| max | 21612.00000 | 9.900000e+09 | 7.700000e+06 | 33.000000 | 8.000000 | 13540.000000 | 1.651359e+06 | 3.500000 | 1.000000 | 4.000000 | ... | 13.000000 | 9410.000000 | 4820.000000 | 2015.000000 | 2015.000000 | 98199.000000 | 47.777600 | -121.315000 | 6210.000000 | 871200.000000 |
8 rows × 21 columns
Module 2: Data Wrangling¶
Question 2¶
Drop the columns "id" and "Unnamed: 0" from axis 1 using the method drop(), then use the method describe() to obtain a statistical summary of the data. Take a screenshot and submit it, make sure the inplace parameter is set to True
df.drop(['id','Unnamed: 0'],axis=1,inplace=True)
df.describe(include = "all")
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 21613 | 2.161300e+04 | 21600.000000 | 21603.000000 | 21613.000000 | 2.161300e+04 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 | 21613.000000 |
| unique | 372 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | 20140623T000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | 142 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | 5.400881e+05 | 3.372870 | 2.115736 | 2079.899736 | 1.510697e+04 | 1.494309 | 0.007542 | 0.234303 | 3.409430 | 7.656873 | 1788.390691 | 291.509045 | 1971.005136 | 84.402258 | 98077.939805 | 47.560053 | -122.213896 | 1986.552492 | 12768.455652 |
| std | NaN | 3.671272e+05 | 0.926657 | 0.768996 | 918.440897 | 4.142051e+04 | 0.539989 | 0.086517 | 0.766318 | 0.650743 | 1.175459 | 828.090978 | 442.575043 | 29.373411 | 401.679240 | 53.505026 | 0.138564 | 0.140828 | 685.391304 | 27304.179631 |
| min | NaN | 7.500000e+04 | 1.000000 | 0.500000 | 290.000000 | 5.200000e+02 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 290.000000 | 0.000000 | 1900.000000 | 0.000000 | 98001.000000 | 47.155900 | -122.519000 | 399.000000 | 651.000000 |
| 25% | NaN | 3.219500e+05 | 3.000000 | 1.750000 | 1427.000000 | 5.040000e+03 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 | 98033.000000 | 47.471000 | -122.328000 | 1490.000000 | 5100.000000 |
| 50% | NaN | 4.500000e+05 | 3.000000 | 2.250000 | 1910.000000 | 7.618000e+03 | 1.500000 | 0.000000 | 0.000000 | 3.000000 | 7.000000 | 1560.000000 | 0.000000 | 1975.000000 | 0.000000 | 98065.000000 | 47.571800 | -122.230000 | 1840.000000 | 7620.000000 |
| 75% | NaN | 6.450000e+05 | 4.000000 | 2.500000 | 2550.000000 | 1.068800e+04 | 2.000000 | 0.000000 | 0.000000 | 4.000000 | 8.000000 | 2210.000000 | 560.000000 | 1997.000000 | 0.000000 | 98118.000000 | 47.678000 | -122.125000 | 2360.000000 | 10083.000000 |
| max | NaN | 7.700000e+06 | 33.000000 | 8.000000 | 13540.000000 | 1.651359e+06 | 3.500000 | 1.000000 | 4.000000 | 5.000000 | 13.000000 | 9410.000000 | 4820.000000 | 2015.000000 | 2015.000000 | 98199.000000 | 47.777600 | -121.315000 | 6210.000000 | 871200.000000 |
We can see we have missing values for the columns bedrooms and bathrooms
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
number of NaN values for the column bedrooms : 13 number of NaN values for the column bathrooms : 10
We can replace the missing values of the column 'bedrooms' with the mean of the column 'bedrooms' using the method replace(). Don't forget to set the inplace parameter to True
mean=df['bedrooms'].mean()
df['bedrooms'].replace(np.nan,mean, inplace=True)
We also replace the missing values of the column 'bathrooms' with the mean of the column 'bathrooms' using the method replace(). Don't forget to set the inplace parameter top True
mean=df['bathrooms'].mean()
df['bathrooms'].replace(np.nan,mean, inplace=True)
print("number of NaN values for the column bedrooms :", df['bedrooms'].isnull().sum())
print("number of NaN values for the column bathrooms :", df['bathrooms'].isnull().sum())
number of NaN values for the column bedrooms : 0 number of NaN values for the column bathrooms : 0
Module 3: Exploratory Data Analysis¶
Question 3¶
Use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe.
df['floors'].value_counts()
1.0 10680 2.0 8241 1.5 1910 3.0 613 2.5 161 3.5 8 Name: floors, dtype: int64
df['floors'].value_counts().to_frame()
| floors | |
|---|---|
| 1.0 | 10680 |
| 2.0 | 8241 |
| 1.5 | 1910 |
| 3.0 | 613 |
| 2.5 | 161 |
| 3.5 | 8 |
Question 4¶
Use the function boxplot in the seaborn library to determine whether houses with a waterfront view or without a waterfront view have more price outliers.
sns.boxplot(x="waterfront", y="price", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8e71d32d0>
Question 5¶
Use the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price.
sns.regplot(x="price", y="sqft_above", data=df)
<matplotlib.axes._subplots.AxesSubplot at 0x7fc8e66e1790>
We can use the Pandas method corr() to find the feature other than price that is most correlated with price.
df.corr()['price'].sort_values()
zipcode -0.053203 long 0.021626 condition 0.036362 yr_built 0.054012 sqft_lot15 0.082447 sqft_lot 0.089661 yr_renovated 0.126434 floors 0.256794 waterfront 0.266369 lat 0.307003 bedrooms 0.308797 sqft_basement 0.323816 view 0.397293 bathrooms 0.525738 sqft_living15 0.585379 sqft_above 0.605567 grade 0.667434 sqft_living 0.702035 price 1.000000 Name: price, dtype: float64
Module 4: Model Development¶
We can Fit a linear regression model using the longitude feature 'long' and caculate the R^2.
X = df[['long']]
Y = df['price']
lm = LinearRegression()
lm.fit(X,Y)
lm.score(X, Y)
0.00046769430149007363
Question 6¶
Fit a linear regression model to predict the 'price' using the feature 'sqft_living' then calculate the R^2. Take a screenshot of your code and the value of the R^2.
X = df[['sqft_living']]
Y = df['price']
lm1 = LinearRegression()
lm1.fit(X,Y)
lm1.score(X, Y)
0.4928532179037931
Question 7¶
Fit a linear regression model to predict the 'price' using the list of features:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
Then calculate the R^2. Take a screenshot of your code.
X = df[features]
Y = df['price']
lm2 = LinearRegression()
lm2.fit(X, Y)
lm2.score(X, Y)
0.657679183672129
This will help with Question 8¶
Create a list of tuples, the first element in the tuple contains the name of the estimator:
'scale'
'polynomial'
'model'
The second element in the tuple contains the model constructor
StandardScaler()
PolynomialFeatures(include_bias=False)
LinearRegression()
Input=[('scale',StandardScaler()),('polynomial', PolynomialFeatures(include_bias=False)),('model',LinearRegression())]
Question 8¶
Use the list to create a pipeline object to predict the 'price', fit the object using the features in the list features, and calculate the R^2.
pipe=Pipeline(Input)
pipe.fit(X,Y)
pipe.score(X,Y)
0.7513408553309376
Module 5: Model Evaluation and Refinement¶
Import the necessary modules:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
print("done")
done
We will split the data into training and testing sets:
features =["floors", "waterfront","lat" ,"bedrooms" ,"sqft_basement" ,"view" ,"bathrooms","sqft_living15","sqft_above","grade","sqft_living"]
X = df[features]
Y = df['price']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=1)
print("number of test samples:", x_test.shape[0])
print("number of training samples:",x_train.shape[0])
number of test samples: 3242 number of training samples: 18371
Question 9¶
Create and fit a Ridge regression object using the training data, set the regularization parameter to 0.1, and calculate the R^2 using the test data.
from sklearn.linear_model import Ridge
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(x_train, y_train)
RigeModel.score(x_test, y_test)
0.6478759163939122
Question 10¶
Perform a second order polynomial transform on both the training data and testing data. Create and fit a Ridge regression object using the training data, set the regularisation parameter to 0.1, and calculate the R^2 utilising the test data provided. Take a screenshot of your code and the R^2.
pr1=PolynomialFeatures(degree=2)
x_train_pr1=pr1.fit_transform(x_train)
x_test_pr1=pr1.fit_transform(x_test)
RigeModel=Ridge(alpha=0.1)
RigeModel.fit(x_train_pr1, y_train)
RigeModel.score(x_test_pr1, y_test)
0.7002744279896707