Density-Based Clustering with DBSCAN¶
This notebook is organized by key machine learning topics, progressing from basic to advanced, and is designed for hands-on experimentation and project work. You are encouraged to explore, modify, and extend each section.
01. Data Loading & Exploration¶
Load the dataset and perform initial exploration. Understand the data structure and basic statistics.
Let me import the libraries I'll need for this density-based clustering analysis:
- numpy for numerical operations
- DBSCAN from sklearn.cluster
- make_blobs for generating sample data
- StandardScaler for data preprocessing
- matplotlib for visualization
Remember to use %matplotlib inline to display plots
# Notice: For visualization of map, you need basemap package.
# if you dont have basemap install on your machine, you can use the following line to install it
# !conda install -c conda-forge basemap matplotlib==3.1 -y
# Notice: you maight have to refresh your page and re-run the notebook after installation
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline
C:\Users\chysa\anaconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Generating Sample Data¶
I'll create some sample data to experiment with DBSCAN. The function below generates clusters with specified characteristics so I can understand how the algorithm works.
The function requires these inputs:
- centroidLocation: Coordinates of the centroids that will generate the random data.
- Example: input: [[4,3], [2,-1], [-1,4]]
- numSamples: The number of data points we want generated, split over the number of centroids (# of centroids defined in centroidLocation)
- Example: 1500
- clusterDeviation: The standard deviation between the clusters. The larger the number, the further the spacing.
- Example: 0.5
def createDataPoints(centroidLocation, numSamples, clusterDeviation):
# Create random data and store in feature matrix X and response vector y.
X, y = make_blobs(n_samples=numSamples, centers=centroidLocation,
cluster_std=clusterDeviation)
# Standardize features by removing the mean and scaling to unit variance
X = StandardScaler().fit_transform(X)
return X, y
Now I'll generate sample data with 3 centroids to see how DBSCAN identifies clusters.
X, y = createDataPoints([[4,3], [2,-1], [-1,4]] , 1500, 0.5)
My DBSCAN Analysis¶
DBSCAN is fascinating because it doesn't require specifying the number of clusters beforehand. Instead, it finds dense regions based on two key parameters:
- Epsilon (eps): The radius within which to search for neighbors
- min_samples: The minimum number of points needed to form a dense region
I find this approach particularly useful for datasets with irregular cluster shapes or noise.
epsilon = 0.3
minimumSamples = 7
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(X)
labels = db.labels_
labels
array([0, 1, 1, ..., 2, 2, 1], dtype=int64)
Identifying Core Points vs Outliers¶
One thing I appreciate about DBSCAN is how it naturally identifies outliers. Let me distinguish between core points (part of clusters) and noise points.
# Firts, create an array of booleans using the labels from db.
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
core_samples_mask
array([ True, True, True, ..., True, True, True])
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_
3
# Remove repetition in labels by turning it into a set.
unique_labels = set(labels)
unique_labels
{0, 1, 2}
Visualizing the Results¶
Now I'll create a visualization to see how well DBSCAN identified the clusters. This helps me understand the algorithm's effectiveness.
# Create colors for the clusters.
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
# Plot the points with colors
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
# Plot the datapoints that are clustered
xy = X[class_member_mask & core_samples_mask]
plt.scatter(xy[:, 0], xy[:, 1],s=50, c=[col], marker=u'o', alpha=0.5)
# Plot the outliers
xy = X[class_member_mask & ~core_samples_mask]
plt.scatter(xy[:, 0], xy[:, 1],s=50, c=[col], marker=u'o', alpha=0.5)
Practice¶
To better underestand differences between partitional and density-based clusteitng, try to cluster the above dataset into 3 clusters using k-Means.
Notice: do not generate data again, use the same dataset as above.
# write your code here
from sklearn.cluster import KMeans
k=3
k_means = KMeans(init = "k-means++", n_clusters = k, n_init = 12)
k_means.fit(X)
fig = plt.figure(figsize=(6, 4))
ax = fig.add_subplot(1, 1, 1)
for k, col in zip(range(k), colors):
my_members = (k_means.labels_ == k)
plt.scatter(X[my_members, 0], X[my_members, 1], c=col, marker=u'o', alpha=0.5)
plt.show()
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points. *c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*. Please use the *color* keyword-argument or provide a 2D array with a single row if you intend to specify the same RGB or RGBA value for all points.
Click here for the solution
from sklearn.cluster import KMeans
k = 3
k_means3 = KMeans(init = "k-means++", n_clusters = k, n_init = 12)
k_means3.fit(X)
fig = plt.figure(figsize=(6, 4))
ax = fig.add_subplot(1, 1, 1)
for k, col in zip(range(k), colors):
my_members = (k_means3.labels_ == k)
plt.scatter(X[my_members, 0], X[my_members, 1], c=col, marker=u'o', alpha=0.5)
plt.show()
Weather Station Clustering using DBSCAN & scikit-learn
DBSCAN is specially very good for tasks like class identification on a spatial context. The wonderful attribute of DBSCAN algorithm is that it can find out any arbitrary shape cluster without getting affected by noise. For example, this following example cluster the location of weather stations in Canada.
DBSCAN can be used here, for instance, to find the group of stations which show the same weather condition. As you can see, it not only finds different arbitrary shaped clusters, can find the denser part of data-centered samples by ignoring less-dense areas or noises.
let's start playing with the data. We will be working according to the following workflow:
- Loading data
- Overview data
- Data cleaning
- Data selection
- Clusteing
About the dataset¶
Environment Canada Monthly Values for July - 2015
| Name in the table | Meaning |
|---|---|
| Stn_Name | Station Name |
| Lat | Latitude (North+, degrees) |
| Long | Longitude (West - , degrees) |
| Prov | Province |
| Tm | Mean Temperature (°C) |
| DwTm | Days without Valid Mean Temperature |
| D | Mean Temperature difference from Normal (1981-2010) (°C) |
| Tx | Highest Monthly Maximum Temperature (°C) |
| DwTx | Days without Valid Maximum Temperature |
| Tn | Lowest Monthly Minimum Temperature (°C) |
| DwTn | Days without Valid Minimum Temperature |
| S | Snowfall (cm) |
| DwS | Days without Valid Snowfall |
| S%N | Percent of Normal (1981-2010) Snowfall |
| P | Total Precipitation (mm) |
| DwP | Days without Valid Precipitation |
| P%N | Percent of Normal (1981-2010) Precipitation |
| S_G | Snow on the ground at the end of the month (cm) |
| Pd | Number of days with Precipitation 1.0 mm or more |
| BS | Bright Sunshine (hours) |
| DwBS | Days without Valid Bright Sunshine |
| BS% | Percent of Normal (1981-2010) Bright Sunshine |
| HDD | Degree Days below 18 °C |
| CDD | Degree Days above 18 °C |
| Stn_No | Climate station identifier (first 3 digits indicate drainage basin, last 4 characters are for sorting alphabetically). |
| NA | Not Available |
1-Download data¶
To download the data, we will use !wget. To download the data, we will use !wget to download it from IBM Object Storage.
Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: Sign up now for free
path ="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%204/data/weather-stations20140101-20141231.csv"
2- Load the dataset¶
We will import the .csv then we creates the columns for year, month and day.
import csv
import pandas as pd
import numpy as np
#Read csv
pdf = pd.read_csv(path)
pdf.head(5)
| Stn_Name | Lat | Long | Prov | Tm | DwTm | D | Tx | DwTx | Tn | ... | DwP | P%N | S_G | Pd | BS | DwBS | BS% | HDD | CDD | Stn_No | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CHEMAINUS | 48.935 | -123.742 | BC | 8.2 | 0.0 | NaN | 13.5 | 0.0 | 1.0 | ... | 0.0 | NaN | 0.0 | 12.0 | NaN | NaN | NaN | 273.3 | 0.0 | 1011500 |
| 1 | COWICHAN LAKE FORESTRY | 48.824 | -124.133 | BC | 7.0 | 0.0 | 3.0 | 15.0 | 0.0 | -3.0 | ... | 0.0 | 104.0 | 0.0 | 12.0 | NaN | NaN | NaN | 307.0 | 0.0 | 1012040 |
| 2 | LAKE COWICHAN | 48.829 | -124.052 | BC | 6.8 | 13.0 | 2.8 | 16.0 | 9.0 | -2.5 | ... | 9.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 168.1 | 0.0 | 1012055 |
| 3 | DISCOVERY ISLAND | 48.425 | -123.226 | BC | NaN | NaN | NaN | 12.5 | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1012475 |
| 4 | DUNCAN KELVIN CREEK | 48.735 | -123.728 | BC | 7.7 | 2.0 | 3.4 | 14.5 | 2.0 | -1.0 | ... | 2.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 267.7 | 0.0 | 1012573 |
5 rows × 25 columns
3-Cleaning¶
Lets remove rows that dont have any value in the Tm field.
pdf = pdf[pd.notnull(pdf["Tm"])]
pdf = pdf.reset_index(drop=True)
pdf.head(5)
| Stn_Name | Lat | Long | Prov | Tm | DwTm | D | Tx | DwTx | Tn | ... | DwP | P%N | S_G | Pd | BS | DwBS | BS% | HDD | CDD | Stn_No | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CHEMAINUS | 48.935 | -123.742 | BC | 8.2 | 0.0 | NaN | 13.5 | 0.0 | 1.0 | ... | 0.0 | NaN | 0.0 | 12.0 | NaN | NaN | NaN | 273.3 | 0.0 | 1011500 |
| 1 | COWICHAN LAKE FORESTRY | 48.824 | -124.133 | BC | 7.0 | 0.0 | 3.0 | 15.0 | 0.0 | -3.0 | ... | 0.0 | 104.0 | 0.0 | 12.0 | NaN | NaN | NaN | 307.0 | 0.0 | 1012040 |
| 2 | LAKE COWICHAN | 48.829 | -124.052 | BC | 6.8 | 13.0 | 2.8 | 16.0 | 9.0 | -2.5 | ... | 9.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 168.1 | 0.0 | 1012055 |
| 3 | DUNCAN KELVIN CREEK | 48.735 | -123.728 | BC | 7.7 | 2.0 | 3.4 | 14.5 | 2.0 | -1.0 | ... | 2.0 | NaN | NaN | 11.0 | NaN | NaN | NaN | 267.7 | 0.0 | 1012573 |
| 4 | ESQUIMALT HARBOUR | 48.432 | -123.439 | BC | 8.8 | 0.0 | NaN | 13.1 | 0.0 | 1.9 | ... | 8.0 | NaN | NaN | 12.0 | NaN | NaN | NaN | 258.6 | 0.0 | 1012710 |
5 rows × 25 columns
4-Visualization¶
Visualization of stations on map using basemap package. The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python. Basemap does not do any plotting on it’s own, but provides the facilities to transform coordinates to a map projections.
Please notice that the size of each data points represents the average of maximum temperature for each station in a year.
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)
llon=-140
ulon=-50
llat=40
ulat=65
pdf = pdf[(pdf['Long'] > llon) & (pdf['Long'] < ulon) & (pdf['Lat'] > llat) &(pdf['Lat'] < ulat)]
my_map = Basemap(projection='merc',
resolution = 'l', area_thresh = 1000.0,
llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)
urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)
my_map.drawcoastlines()
my_map.drawcountries()
# my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()
# To collect data based on stations
xs,ys = my_map(np.asarray(pdf.Long), np.asarray(pdf.Lat))
pdf['xm']= xs.tolist()
pdf['ym'] =ys.tolist()
#Visualization1
for index,row in pdf.iterrows():
# x,y = my_map(row.Long, row.Lat)
my_map.plot(row.xm, row.ym,markerfacecolor =([1,0,0]), marker='o', markersize= 5, alpha = 0.75)
#plt.text(x,y,stn)
plt.show()
5- Clustering of stations based on their location i.e. Lat & Lon¶
DBSCAN form sklearn library can runs DBSCAN clustering from vector array or distance matrix. In our case, we pass it the Numpy array Clus_dataSet to find core samples of high density and expands clusters from them.
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['xm','ym']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.15, min_samples=10).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
# A sample of clusters
pdf[["Stn_Name","Tx","Tm","Clus_Db"]].head(5)
| Stn_Name | Tx | Tm | Clus_Db | |
|---|---|---|---|---|
| 0 | CHEMAINUS | 13.5 | 8.2 | 0 |
| 1 | COWICHAN LAKE FORESTRY | 15.0 | 7.0 | 0 |
| 2 | LAKE COWICHAN | 16.0 | 6.8 | 0 |
| 3 | DUNCAN KELVIN CREEK | 14.5 | 7.7 | 0 |
| 4 | ESQUIMALT HARBOUR | 13.1 | 8.8 | 0 |
As you can see for outliers, the cluster label is -1
set(labels)
{-1, 0, 1, 2, 3, 4}
6- Visualization of clusters based on location¶
Now, we can visualize the clusters using basemap:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)
my_map = Basemap(projection='merc',
resolution = 'l', area_thresh = 1000.0,
llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)
urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)
my_map.drawcoastlines()
my_map.drawcountries()
#my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()
# To create a color map
colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, clusterNum))
#Visualization1
for clust_number in set(labels):
c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[int(clust_number)])
clust_set = pdf[pdf.Clus_Db == clust_number]
my_map.scatter(clust_set.xm, clust_set.ym, color =c, marker='o', s= 20, alpha = 0.85)
if clust_number != -1:
cenx=np.mean(clust_set.xm)
ceny=np.mean(clust_set.ym)
plt.text(cenx,ceny,str(clust_number), fontsize=35, color='red',)
print ("Cluster "+str(clust_number)+', Avg Temp: '+ str(np.mean(clust_set.Tm)))
Cluster 0, Avg Temp: 6.2211920529801334 Cluster 1, Avg Temp: 6.790000000000001 Cluster 2, Avg Temp: -0.49411764705882355 Cluster 3, Avg Temp: -13.877209302325586 Cluster 4, Avg Temp: -4.186274509803922 Cluster 5, Avg Temp: -16.301503759398482 Cluster 6, Avg Temp: -13.599999999999998 Cluster 7, Avg Temp: -9.753333333333334 Cluster 8, Avg Temp: -4.258333333333334
7- Clustering of stations based on their location, mean, max, and min Temperature¶
In this section we re-run DBSCAN, but this time on a 5-dimensional dataset:
from sklearn.cluster import DBSCAN
import sklearn.utils
from sklearn.preprocessing import StandardScaler
sklearn.utils.check_random_state(1000)
Clus_dataSet = pdf[['xm','ym','Tx','Tm','Tn']]
Clus_dataSet = np.nan_to_num(Clus_dataSet)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(Clus_dataSet)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
pdf["Clus_Db"]=labels
realClusterNum=len(set(labels)) - (1 if -1 in labels else 0)
clusterNum = len(set(labels))
# A sample of clusters
pdf[["Stn_Name","Tx","Tm","Clus_Db"]].head(5)
| Stn_Name | Tx | Tm | Clus_Db | |
|---|---|---|---|---|
| 0 | CHEMAINUS | 13.5 | 8.2 | 0 |
| 1 | COWICHAN LAKE FORESTRY | 15.0 | 7.0 | 0 |
| 2 | LAKE COWICHAN | 16.0 | 6.8 | 0 |
| 3 | DUNCAN KELVIN CREEK | 14.5 | 7.7 | 0 |
| 4 | ESQUIMALT HARBOUR | 13.1 | 8.8 | 0 |
8- Visualization of clusters based on location and Temperture¶
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = (14,10)
my_map = Basemap(projection='merc',
resolution = 'l', area_thresh = 1000.0,
llcrnrlon=llon, llcrnrlat=llat, #min longitude (llcrnrlon) and latitude (llcrnrlat)
urcrnrlon=ulon, urcrnrlat=ulat) #max longitude (urcrnrlon) and latitude (urcrnrlat)
my_map.drawcoastlines()
my_map.drawcountries()
#my_map.drawmapboundary()
my_map.fillcontinents(color = 'white', alpha = 0.3)
my_map.shadedrelief()
# To create a color map
colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, clusterNum))
#Visualization1
for clust_number in set(labels):
c=(([0.4,0.4,0.4]) if clust_number == -1 else colors[int(clust_number)])
clust_set = pdf[pdf.Clus_Db == clust_number]
my_map.scatter(clust_set.xm, clust_set.ym, color =c, marker='o', s= 20, alpha = 0.85)
if clust_number != -1:
cenx=np.mean(clust_set.xm)
ceny=np.mean(clust_set.ym)
plt.text(cenx,ceny,str(clust_number), fontsize=35, color='red',)
print ("Cluster "+str(clust_number)+', Avg Temp: '+ str(np.mean(clust_set.Tm)))
Cluster 0, Avg Temp: 6.2211920529801334 Cluster 1, Avg Temp: 6.790000000000001 Cluster 2, Avg Temp: -0.49411764705882355 Cluster 3, Avg Temp: -13.877209302325586 Cluster 4, Avg Temp: -4.186274509803922 Cluster 5, Avg Temp: -16.301503759398482 Cluster 6, Avg Temp: -13.599999999999998 Cluster 7, Avg Temp: -9.753333333333334 Cluster 8, Avg Temp: -4.258333333333334