Support Vector Machines: Topic-Based Exploration¶
This notebook is organized by key machine learning topics, progressing from basic to advanced, and is designed for hands-on experimentation and project work. You are encouraged to explore, modify, and extend each section.
In this project, I use Support Vector Machines (SVM) to classify human cell records as benign or malignant. My goal is to build a model that can assist in early cancer detection. This notebook reflects my personal approach and learning process.
SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.
Table of Contents¶
- Load the cancer dataset
- Data pre-processing and selection
- Modeling
- Evaluation
- Practice
#!pip install scikit-learn==0.23.1
# import piplite
# await piplite.install(['pandas'])
# await piplite.install(['matplotlib'])
# await piplite.install(['numpy'])
# await piplite.install(['scikit-learn'])
# await piplite.install(['scipy'])
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[3], line 2 1 import pandas as pd ----> 2 import pylab as pl 3 import numpy as np 4 import scipy.optimize as opt ModuleNotFoundError: No module named 'pylab'
# from pyodide.http import pyfetch
# async def download(url, filename):
# response = await pyfetch(url)
# if response.status == 200:
# with open(filename, "wb") as f:
# f.write(await response.bytes())
Load the Cancer Dataset¶
For this project, I use a dataset of human cell samples. Each record contains measurements of cell characteristics and a label indicating whether the sample is benign or malignant. My goal is to build a model that can accurately classify new samples.
The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:
| Field name | Description |
|---|---|
| ID | Cell id |
| Clump | Clump thickness |
| UnifSize | Uniformity of cell size |
| UnifShape | Uniformity of cell shape |
| MargAdh | Marginal adhesion |
| SingEpiSize | Single epithelial cell size |
| BareNuc | Bare nuclei |
| BlandChrom | Bland chromatin |
| NormNucl | Normal nucleoli |
| Mit | Mitoses |
| Class | Benign or malignant |
For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use !wget to download it from IBM Object Storage.
Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb of IBM Cloud Object Storage: Sign up now for free
# The dataset is available online. I'll use Python to download and load it for analysis.
path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv"
Load Data From CSV File¶
Let's read the data into a pandas DataFrame.
# await download(path, "cell_samples.csv")
cell_df = pd.read_csv(path)
cell_df.head()
| ID | Clump | UnifSize | UnifShape | MargAdh | SingEpiSize | BareNuc | BlandChrom | NormNucl | Mit | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
| 1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
| 2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
| 3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
| 4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.
The 'Class' field contains the diagnosis: 2 for benign and 4 for malignant. I'll visualize the distribution of these classes using scatter plots.
Let's look at the distribution of the classes based on Clump thickness and Uniformity of cell size:
ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');
cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()
Data Pre-processing and Selection¶
I'll clean the data and select relevant features for modeling.
First, I'll check the data types of each column to identify any non-numeric values that need to be handled.
cell_df.dtypes
ID int64 Clump int64 UnifSize int64 UnifShape int64 MargAdh int64 SingEpiSize int64 BareNuc object BlandChrom int64 NormNucl int64 Mit int64 Class int64 dtype: object
The 'BareNuc' column contains some non-numeric values. I'll remove those rows and convert the column to integers.
cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]
cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')
cell_df.dtypes
ID int64 Clump int64 UnifSize int64 UnifShape int64 MargAdh int64 SingEpiSize int64 BareNuc int32 BlandChrom int64 NormNucl int64 Mit int64 Class int64 dtype: object
feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)
X[0:5]
array([[ 5, 1, 1, 1, 2, 1, 3, 1, 1],
[ 5, 4, 4, 5, 7, 10, 3, 2, 1],
[ 3, 1, 1, 1, 2, 2, 3, 1, 1],
[ 6, 8, 8, 1, 3, 4, 3, 7, 1],
[ 4, 1, 1, 3, 2, 1, 3, 1, 1]], dtype=int64)
I want the model to predict whether a sample is benign or malignant. I'll convert the 'Class' column to integers and use it as the target variable.
cell_df['Class'] = cell_df['Class'].astype('int')
y = np.asarray(cell_df['Class'])
y [0:5]
array([2, 2, 2, 2, 2])
Train/Test Split¶
I'll split the data into training and test sets to evaluate the model's performance.
We split our dataset into train and test set. Splitting the data ensures that I can test the model on unseen data and get a realistic estimate of its accuracy.
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
Train set: (546, 9) (546,) Test set: (137, 9) (137,)
Modeling (SVM with scikit-learn)¶
I'll build an SVM classifier using the default RBF kernel and fit it to the training data.
The SVM algorithm can use different kernel functions to map data into higher-dimensional spaces. I'll start with the RBF (Radial Basis Function) kernel, which is commonly used for non-linear data.
Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:
1.Linear
2.Polynomial
3.Radial basis function (RBF)
4.Sigmoid
Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset. We usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
SVC()
After training the model, I'll use it to predict the class of new samples in the test set.
yhat = clf.predict(X_test)
yhat [0:5]
array([2, 4, 2, 4, 2])
Evaluation¶
I'll use metrics like accuracy, confusion matrix, F1 score, and Jaccard index to evaluate the model's performance.
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0}), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)
print (classification_report(y_test, yhat))
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')
precision recall f1-score support
2 1.00 0.94 0.97 90
4 0.90 1.00 0.95 47
accuracy 0.96 137
macro avg 0.95 0.97 0.96 137
weighted avg 0.97 0.96 0.96 137
Confusion matrix, without normalization
[[85 5]
[ 0 47]]
The F1 score balances precision and recall, providing a single metric for model performance.
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')
0.9639038982104676
The Jaccard index measures the similarity between predicted and actual labels. A higher value indicates better performance.
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=2)
0.9444444444444444
Practice: Try a Linear Kernel
Can you rebuild the model using a __linear__ kernel? You can use __kernel='linear'__ option when you define the svm. How does the accuracy compare to the RBF kernel?# write your code here
clf1 = svm.SVC(kernel='linear')
clf1.fit(X_train, y_train)
yhat1 = clf1.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat1, average='weighted'))
print("Jaccard score: %.4f" % jaccard_score(y_test, yhat1,pos_label=2))
Avg F1-score: 0.9639 Jaccard score: 0.9444
Show solution
clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, y_train)
yhat2 = clf2.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat2, average='weighted'))
print("Jaccard score: %.4f" % jaccard_score(y_test, yhat2,pos_label=2))
Thank you for exploring Support Vector Machines with me!¶
Notebook authored and personalized by Mohammad Sayem Chowdhury