Support Vector Machines: Topic-Based Exploration¶

This notebook is organized by key machine learning topics, progressing from basic to advanced, and is designed for hands-on experimentation and project work. You are encouraged to explore, modify, and extend each section.


In this project, I use Support Vector Machines (SVM) to classify human cell records as benign or malignant. My goal is to build a model that can assist in early cancer detection. This notebook reflects my personal approach and learning process.

SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.

Table of Contents¶

  1. Load the cancer dataset
  2. Data pre-processing and selection
  3. Modeling
  4. Evaluation
  5. Practice

In [1]:
#!pip install scikit-learn==0.23.1
In [2]:
# import piplite
# await piplite.install(['pandas'])
# await piplite.install(['matplotlib'])
# await piplite.install(['numpy'])
# await piplite.install(['scikit-learn'])
# await piplite.install(['scipy'])
In [3]:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline 
import matplotlib.pyplot as plt
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[3], line 2
      1 import pandas as pd
----> 2 import pylab as pl
      3 import numpy as np
      4 import scipy.optimize as opt

ModuleNotFoundError: No module named 'pylab'
In [ ]:
# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())

Load the Cancer Dataset¶

For this project, I use a dataset of human cell samples. Each record contains measurements of cell characteristics and a label indicating whether the sample is benign or malignant. My goal is to build a model that can accurately classify new samples.

The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:

Field name Description
ID Cell id
Clump Clump thickness
UnifSize Uniformity of cell size
UnifShape Uniformity of cell shape
MargAdh Marginal adhesion
SingEpiSize Single epithelial cell size
BareNuc Bare nuclei
BlandChrom Bland chromatin
NormNucl Normal nucleoli
Mit Mitoses
Class Benign or malignant


For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use !wget to download it from IBM Object Storage.

Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: Sign up now for free

In [ ]:
# The dataset is available online. I'll use Python to download and load it for analysis.
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv"

Load Data From CSV File¶

Let's read the data into a pandas DataFrame.

In [ ]:
# await download(path, "cell_samples.csv")
In [ ]:
cell_df = pd.read_csv(path)
cell_df.head()
Out[ ]:
ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc BlandChrom NormNucl Mit Class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2

The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.

The 'Class' field contains the diagnosis: 2 for benign and 4 for malignant. I'll visualize the distribution of these classes using scatter plots.

Let's look at the distribution of the classes based on Clump thickness and Uniformity of cell size:

In [ ]:
ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');
cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()
No description has been provided for this image

Data Pre-processing and Selection¶

I'll clean the data and select relevant features for modeling.

First, I'll check the data types of each column to identify any non-numeric values that need to be handled.

In [ ]:
cell_df.dtypes
Out[ ]:
ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

The 'BareNuc' column contains some non-numeric values. I'll remove those rows and convert the column to integers.

In [ ]:
cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')
cell_df.dtypes
Out[ ]:
ID             int64
Clump          int64
UnifSize       int64
UnifShape      int64
MargAdh        int64
SingEpiSize    int64
BareNuc        int32
BlandChrom     int64
NormNucl       int64
Mit            int64
Class          int64
dtype: object
In [ ]:
feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)
X[0:5]
Out[ ]:
array([[ 5,  1,  1,  1,  2,  1,  3,  1,  1],
       [ 5,  4,  4,  5,  7, 10,  3,  2,  1],
       [ 3,  1,  1,  1,  2,  2,  3,  1,  1],
       [ 6,  8,  8,  1,  3,  4,  3,  7,  1],
       [ 4,  1,  1,  3,  2,  1,  3,  1,  1]], dtype=int64)

I want the model to predict whether a sample is benign or malignant. I'll convert the 'Class' column to integers and use it as the target variable.

In [ ]:
cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])
y [0:5]
Out[ ]:
array([2, 2, 2, 2, 2])

Train/Test Split¶

I'll split the data into training and test sets to evaluate the model's performance.

We split our dataset into train and test set. Splitting the data ensures that I can test the model on unseen data and get a realistic estimate of its accuracy.

In [ ]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)
Train set: (546, 9) (546,)
Test set: (137, 9) (137,)

Modeling (SVM with scikit-learn)¶

I'll build an SVM classifier using the default RBF kernel and fit it to the training data.

The SVM algorithm can use different kernel functions to map data into higher-dimensional spaces. I'll start with the RBF (Radial Basis Function) kernel, which is commonly used for non-linear data.

Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

1.Linear
2.Polynomial
3.Radial basis function (RBF)
4.Sigmoid

Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset. We usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.

In [ ]:
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train) 
Out[ ]:
SVC()

After training the model, I'll use it to predict the class of new samples in the test set.

In [ ]:
yhat = clf.predict(X_test)
yhat [0:5]
Out[ ]:
array([2, 4, 2, 4, 2])

Evaluation¶

I'll use metrics like accuracy, confusion matrix, F1 score, and Jaccard index to evaluate the model's performance.

In [ ]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools
In [ ]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0}), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [ ]:
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)

print (classification_report(y_test, yhat))

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False,  title='Confusion matrix')
              precision    recall  f1-score   support

           2       1.00      0.94      0.97        90
           4       0.90      1.00      0.95        47

    accuracy                           0.96       137
   macro avg       0.95      0.97      0.96       137
weighted avg       0.97      0.96      0.96       137

Confusion matrix, without normalization
[[85  5]
 [ 0 47]]
No description has been provided for this image

The F1 score balances precision and recall, providing a single metric for model performance.

In [ ]:
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted') 
Out[ ]:
0.9639038982104676

The Jaccard index measures the similarity between predicted and actual labels. A higher value indicates better performance.

In [ ]:
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=2)
Out[ ]:
0.9444444444444444

Practice: Try a Linear Kernel

Can you rebuild the model using a __linear__ kernel? You can use __kernel='linear'__ option when you define the svm. How does the accuracy compare to the RBF kernel?
In [ ]:
# write your code here
clf1 = svm.SVC(kernel='linear')
clf1.fit(X_train, y_train) 
yhat1 = clf1.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat1, average='weighted'))
print("Jaccard score: %.4f" % jaccard_score(y_test, yhat1,pos_label=2))
Avg F1-score: 0.9639
Jaccard score: 0.9444
Show solution
clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, y_train) 
yhat2 = clf2.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat2, average='weighted'))
print("Jaccard score: %.4f" % jaccard_score(y_test, yhat2,pos_label=2))

Thank you for exploring Support Vector Machines with me!¶

Notebook authored and personalized by Mohammad Sayem Chowdhury