Decision Trees: Topic-Based Exploration¶

This notebook is organized by key machine learning topics, progressing from basic to advanced, and is designed for hands-on experimentation and project work. You are encouraged to explore, modify, and extend each section.

01. Data Loading & Exploration¶

Load the dataset and perform initial exploration. Understand the data structure and basic statistics.

Table of Contents¶

  1. About the dataset
  2. Downloading the data
  3. Data pre-processing
  4. Setting up the Decision Tree
  5. Modeling
  6. Prediction
  7. Evaluation
  8. Visualization

Let's import the libraries I'll use for this project:

  • numpy (as np)
  • pandas
  • DecisionTreeClassifier from sklearn.tree

(If you are using your own environment, ensure the required libraries are installed.)¶

In [ ]:
# import piplite
# await piplite.install(['pandas'])
# await piplite.install(['matplotlib'])
# await piplite.install(['numpy'])
# await piplite.install(['scikit-learn'])
In [1]:
import numpy as np 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import sklearn.tree as tree
In [ ]:
# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())

About the dataset

As someone interested in healthcare analytics, I compiled a dataset of patients who all suffered from the same illness. Each patient responded to one of five medications: Drug A, Drug B, Drug C, Drug X, or Drug Y. My aim is to build a model that can recommend the most appropriate drug for future patients based on their features: Age, Sex, Blood Pressure, Cholesterol, and Sodium-to-Potassium ratio.

Downloading the Data¶

The dataset is available online. I'll use Python to download and load it for analysis.

In [3]:
path= 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv'
# await download(path,"drug200.csv")
# path="drug200.csv"

Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a business, where can you host your data?

Now, let's read the data into a pandas DataFrame:

In [4]:
my_data = pd.read_csv(path, delimiter=",")
my_data[0:5]
Out[4]:
Age Sex BP Cholesterol Na_to_K Drug
0 23 F HIGH HIGH 25.355 drugY
1 47 M LOW HIGH 13.093 drugC
2 47 M LOW HIGH 10.114 drugC
3 28 F NORMAL HIGH 7.798 drugX
4 61 F LOW HIGH 18.043 drugY

Practice: What is the size of the data?

Try printing the shape of the DataFrame to see how many rows and columns it contains.
In [5]:
# write your code here

my_data.shape
Out[5]:
(200, 6)
Click here for the solution
my_data.shape

Data Pre-processing¶

To prepare the data for modeling, I'll separate the features and the target variable.

Using the DataFrame, I'll define:

  • X as the feature matrix (input variables)
  • y as the response vector (target variable)

Since the target column contains categorical values, I'll remove it from the feature set to focus only on numeric/categorical features for modeling.

In [6]:
X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
X[0:5]
Out[6]:
array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

Some features, like Sex and Blood Pressure, are categorical. I'll convert them to numeric values using label encoding so the Decision Tree can process them.

In [7]:
from sklearn import preprocessing
le_sex = preprocessing.LabelEncoder()
le_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) 


le_BP = preprocessing.LabelEncoder()
le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2])


le_Chol = preprocessing.LabelEncoder()
le_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) 

X[0:5]
Out[7]:
array([[23, 0, 0, 0, 25.355],
       [47, 1, 1, 0, 13.093],
       [47, 1, 1, 0, 10.114],
       [28, 0, 2, 0, 7.798],
       [61, 0, 1, 0, 18.043]], dtype=object)

Now I'll assign the target variable, which is the drug each patient responded to.

In [8]:
y = my_data["Drug"]
y[0:5]
Out[8]:
0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

Setting up the Decision Tree

I'll use a train/test split to evaluate the model's performance. Let's import the necessary function and split the data.
In [9]:
from sklearn.model_selection import train_test_split

Now train_test_split will return 4 different parameters. We will name them:
X_trainset, X_testset, y_trainset, y_testset

The train_test_split will need the parameters:
X, y, test_size=0.3, and random_state=3.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.

The train_test_split function returns four objects: training features, test features, training labels, and test labels. I'll use a 70/30 split and set a random seed for reproducibility.

In [10]:
X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)

Practice: Print the shape of the training sets

Check that the dimensions of the training features and labels match.
In [13]:
# your code
print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))
Shape of X training set (140, 5) &  Size of Y training set (140,)
Show solution
print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))

Now, print the shape of the test features and labels to confirm the split.

In [14]:
# your code
print('Shape of X test set {}'.format(X_testset.shape),'&',' Size of Y test set {}'.format(y_testset.shape))
Shape of X test set (60, 5) &  Size of Y test set (60,)
Show solution
print('Shape of X test set {}'.format(X_testset.shape),'&',' Size of Y test set {}'.format(y_testset.shape))

Modeling

I'll create an instance of the DecisionTreeClassifier and specify the criterion as 'entropy' to measure information gain. I'll also set a maximum depth to avoid overfitting.
In [15]:
drugTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
drugTree # it shows the default parameters
Out[15]:
DecisionTreeClassifier(criterion='entropy', max_depth=4)

Next, I'll fit the model using the training data.

In [16]:
drugTree.fit(X_trainset,y_trainset)
Out[16]:
DecisionTreeClassifier(criterion='entropy', max_depth=4)

Prediction

Now I'll use the trained model to make predictions on the test set and compare them to the actual values.
In [17]:
predTree = drugTree.predict(X_testset)

You can print both the predicted and actual values to visually inspect the model's performance.

In [18]:
print (predTree [0:5])
print (y_testset [0:5])
['drugY' 'drugX' 'drugX' 'drugX' 'drugX']
40     drugY
51     drugX
139    drugX
197    drugX
170    drugX
Name: Drug, dtype: object

Evaluation

To evaluate the model, I'll use accuracy as the metric. This will show how well the Decision Tree predicts the correct drug for new patients.
In [19]:
from sklearn import metrics
import matplotlib.pyplot as plt
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_testset, predTree))
DecisionTrees's Accuracy:  0.9833333333333333

Accuracy classification score computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

The accuracy score measures the proportion of correct predictions. In multiclass classification, the prediction must match the true label exactly to be counted as correct.


Visualization¶

Let's visualize the structure of the trained Decision Tree to better understand how decisions are made.

In [ ]:
# Notice: You might need to uncomment and install the pydotplus and graphviz libraries if you have not installed these before
#!conda install -c conda-forge pydotplus -y
#!conda install -c conda-forge python-graphviz -y
In [20]:
tree.plot_tree(drugTree)
plt.show()
No description has been provided for this image

Personal Experimentation Space¶

Use this section to try new ideas, tune hyperparameters, or test additional algorithms. Document your experiments and insights here.


Project-Oriented Challenge¶

Design and implement your own end-to-end decision tree classification project. Define your problem statement, preprocess data, engineer features, select and tune models, and present your results. Use the space below to outline and execute your project.

Last updated: June 13, 2025
Python version: 3.8+
Required packages: See requirements.txt
Prerequisites: Basic Python, Jupyter Notebook