My Unique Approach to Customer Segmentation with K-Nearest Neighbors¶

Why I Chose This Project¶

In this notebook, I explore how K-Nearest Neighbors (KNN) can be used to segment customers based on real-world data. My aim is to understand how different values of K affect classification, and to share my own workflow, insights, and lessons learned along the way.


K-Nearest Neighbors: Topic-Based Classification¶

For this project, I wanted to see how KNN could help businesses better understand their customers. By analyzing demographic and usage data, I set out to build a model that predicts customer segments. Throughout this notebook, I share my personal approach, the challenges I faced, and the strategies I used to overcome them.

K-Nearest Neighbors (KNN) is a simple yet powerful supervised learning algorithm. My understanding is that it works by looking at the K closest data points to a new observation and assigning the most common class among those neighbors. This approach feels intuitive and is easy to visualize.

My Intuitive Take on KNN¶

Imagine plotting all your data points on a graph. When a new customer comes in, you look at their K nearest neighbors and see which group they belong to most often. That’s the group you predict for the new customer. This hands-on analogy helped me grasp the core idea behind KNN.

While experimenting, I noticed that the value of K can really change the results. A small K can make the model sensitive to outliers, while a large K can smooth out important differences. I decided to try several K values and see how the model’s accuracy changed, documenting my findings along the way.

In summary, my approach with KNN is to use the wisdom of the closest neighbors to make predictions, always keeping in mind the importance of choosing the right K for the problem at hand.

Table of Contents¶

  1. About the dataset
  2. Data visualization and analysis
  3. Classification

In [ ]:
#!pip install scikit-learn==0.23.1
In [ ]:
# import piplite
# await piplite.install(['pandas'])
# await piplite.install(['matplotlib'])
# await piplite.install(['numpy'])
# await piplite.install(['scikit-learn'])
# await piplite.install(['scipy'])

Let's load required libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import preprocessing
%matplotlib inline

About the Dataset¶

For this project, I selected a customer dataset from a telecom provider. The company grouped its customers based on how they use services. My goal was to predict which group a new customer would fall into, using features like region, age, and income. This real-world scenario motivated me to see how machine learning could drive business decisions.

The dataset includes a variety of features, such as region, age, marital status, and more. The target variable, custcat, represents four customer groups:

  1. Basic Service
  2. E-Service
  3. Plus Service
  4. Total Service

I set out to build a KNN classifier to predict the group for new or unknown customers, focusing on practical business applications.

Data Download and Preparation¶

I sourced the dataset online and used Python to load it for analysis. This step is crucial for reproducibility and transparency in any data science project.

In [ ]:
# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())
In [4]:
path="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/teleCust1000t.csv"

Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data?

Load Data From CSV File¶

Let's read the data into a pandas DataFrame.

In [ ]:
# await download(path, 'teleCust1000t.csv')
            
In [5]:
df = pd.read_csv(path)
df.head()
Out[5]:
region tenure age marital address income ed employ retire gender reside custcat
0 2 13 44 1 9 64.0 4 5 0.0 0 2 1
1 3 11 33 1 7 136.0 5 5 0.0 0 6 4
2 3 68 52 1 24 116.0 1 29 0.0 1 2 3
3 2 33 33 0 12 33.0 2 0 0.0 1 1 1
4 2 23 30 1 9 30.0 1 2 0.0 0 4 3

Data Visualization and Analysis¶

Before modeling, I explored the distribution of customer groups and visualized key features. This helped me spot patterns and potential issues in the data.

To get a sense of the data, I checked how many customers belonged to each group. For example, I found that Plus Service customers were the most common, which could influence business strategy.

I used histograms and other plots to better understand the data. For instance, visualizing income distribution revealed interesting patterns across customer groups, guiding my feature selection.

Feature Selection¶

I carefully chose which columns to use as features (X) for the model, focusing on those most relevant to customer segmentation. This step is key to building an effective classifier.

Let's select the relevant columns from the DataFrame to use as features (X).

In [8]:
df.columns
Out[8]:
Index(['region', 'tenure', 'age', 'marital', 'address', 'income', 'ed',
       'employ', 'retire', 'gender', 'reside', 'custcat'],
      dtype='object')

To prepare the data for scikit-learn, I converted the DataFrame to a NumPy array. This made it easier to work with the modeling tools.

In [9]:
X = df[['region', 'tenure','age', 'marital', 'address', 'income', 'ed', 'employ','retire', 'gender', 'reside']] .values  #.astype(float)
X[0:5]
Out[9]:
array([[  2.,  13.,  44.,   1.,   9.,  64.,   4.,   5.,   0.,   0.,   2.],
       [  3.,  11.,  33.,   1.,   7., 136.,   5.,   5.,   0.,   0.,   6.],
       [  3.,  68.,  52.,   1.,  24., 116.,   1.,  29.,   0.,   1.,   2.],
       [  2.,  33.,  33.,   0.,  12.,  33.,   2.,   0.,   0.,   1.,   1.],
       [  2.,  23.,  30.,   1.,   9.,  30.,   1.,   2.,   0.,   0.,   4.]])

What are our labels?

The labels (y) are the customer group assignments for each row.

In [10]:
y = df['custcat'].values
y[0:5]
Out[10]:
array([1, 4, 3, 1, 3], dtype=int64)

Data Normalization¶

I standardized the data to have zero mean and unit variance. This is especially important for KNN, since it relies on distance calculations. Normalization ensures that all features contribute equally.

Data Standardization gives the data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is based on the distance of data points. Standardization ensures that all features contribute equally to the distance calculations in KNN.

In [11]:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]
Out[11]:
array([[-0.02696767, -1.055125  ,  0.18450456,  1.0100505 , -0.25303431,
        -0.12650641,  1.0877526 , -0.5941226 , -0.22207644, -1.03459817,
        -0.23065004],
       [ 1.19883553, -1.14880563, -0.69181243,  1.0100505 , -0.4514148 ,
         0.54644972,  1.9062271 , -0.5941226 , -0.22207644, -1.03459817,
         2.55666158],
       [ 1.19883553,  1.52109247,  0.82182601,  1.0100505 ,  1.23481934,
         0.35951747, -1.36767088,  1.78752803, -0.22207644,  0.96655883,
        -0.23065004],
       [-0.02696767, -0.11831864, -0.69181243, -0.9900495 ,  0.04453642,
        -0.41625141, -0.54919639, -1.09029981, -0.22207644,  0.96655883,
        -0.92747794],
       [-0.02696767, -0.58672182, -0.93080797,  1.0100505 , -0.25303431,
        -0.44429125, -1.36767088, -0.89182893, -0.22207644, -1.03459817,
         1.16300577]])

Train/Test Split¶

To evaluate my model, I split the data into training and test sets. This approach helps prevent overfitting and gives a realistic estimate of how the model will perform on new data. I always check the shapes of my splits to ensure everything is set up correctly.

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)
Train set: (800, 11) (800,)
Test set: (200, 11) (200,)

My KNN Classification Workflow¶

With the data ready, I built and evaluated the KNN classifier. I started with k=4, then experimented with other values to see how accuracy changed. This hands-on process deepened my understanding of model tuning.

K-Nearest Neighbors (KNN)¶

I'll start by training the model with k=4.

Importing the KNN Classifier¶

The KNeighborsClassifier implements the KNN algorithm for classification tasks.

In [13]:
from sklearn.neighbors import KNeighborsClassifier

Training the Model¶

I'll start with k=4 and fit the model to the training data.

In [14]:
k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
Out[14]:
KNeighborsClassifier(n_neighbors=4)

Making Predictions¶

Now I'll use the trained model to predict the customer group for the test set.

In [15]:
yhat = neigh.predict(X_test)
yhat[0:5]
Out[15]:
array([1, 1, 3, 2, 4], dtype=int64)

Evaluating Model Accuracy¶

I used accuracy as my main metric, since the goal was to correctly classify customer groups. In multiclass classification, the prediction must match the true label exactly. I compared train and test accuracy to check for overfitting.

In [16]:
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
Train set Accuracy:  0.5475
Test set Accuracy:  0.32

Practice: Experimenting with K¶

To further my learning, I tried building the model with k=6 and compared the results. This experimentation helped me see the impact of K on model performance.

In [22]:
# write your code here
k = 6
#Train Model and Predict  
neigh6 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)

yhat6 = neigh6.predict(X_test)

print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh6.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat6))
Train set Accuracy:  0.51625
Test set Accuracy:  0.31
Show solution
k = 6
neigh6 = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat6 = neigh6.predict(X_test)
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh6.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat6))

Tuning K for Best Results¶

Choosing the right K is a balancing act. I reserved part of my data for testing, then tried different K values to find the one with the highest accuracy. Plotting accuracy against K made it easy to spot the optimal value.

In [29]:
Ks = 100
mean_acc = np.zeros((Ks-1))
std_acc = np.zeros((Ks-1))

for n in range(1,Ks):
    
    #Train Model and Predict  
    neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

mean_acc
Out[29]:
array([0.3  , 0.29 , 0.315, 0.32 , 0.315, 0.31 , 0.335, 0.325, 0.34 ,
       0.33 , 0.315, 0.34 , 0.33 , 0.315, 0.34 , 0.36 , 0.355, 0.35 ,
       0.345, 0.335, 0.35 , 0.36 , 0.37 , 0.365, 0.365, 0.365, 0.35 ,
       0.36 , 0.38 , 0.385, 0.395, 0.395, 0.38 , 0.37 , 0.365, 0.385,
       0.395, 0.41 , 0.395, 0.395, 0.395, 0.38 , 0.39 , 0.375, 0.365,
       0.38 , 0.375, 0.375, 0.365, 0.36 , 0.36 , 0.365, 0.37 , 0.38 ,
       0.37 , 0.37 , 0.37 , 0.36 , 0.35 , 0.36 , 0.355, 0.36 , 0.36 ,
       0.36 , 0.34 , 0.34 , 0.345, 0.35 , 0.35 , 0.355, 0.365, 0.355,
       0.355, 0.365, 0.37 , 0.37 , 0.37 , 0.35 , 0.35 , 0.35 , 0.35 ,
       0.36 , 0.355, 0.33 , 0.32 , 0.345, 0.345, 0.345, 0.335, 0.345,
       0.355, 0.345, 0.345, 0.34 , 0.34 , 0.335, 0.345, 0.325, 0.315])

Plotting Model Accuracy for Different K Values¶

In [30]:
plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.fill_between(range(1,Ks),mean_acc - 3 * std_acc,mean_acc + 3 * std_acc, alpha=0.10,color="green")
plt.legend(('Accuracy ', '+/- 1xstd','+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Neighbors (K)')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [31]:
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 
The best accuracy was with 0.41 with k= 38

Reflections and Takeaways¶

This project taught me the importance of data preparation, model tuning, and clear communication. KNN is a great starting point for classification tasks, and experimenting with different parameters gave me valuable insights into how machine learning models work in practice.

Notebook authored and personalized by Mohammad Sayem Chowdhury