Loan Classification: Topic-Based Exploration¶

Welcome! This notebook is organized by key machine learning topics, progressing from basic to advanced, and is designed for hands-on experimentation and project work. You are encouraged to explore, modify, and extend each section.

01. Data Loading & Exploration¶

Load the dataset and perform initial exploration. Understand the data structure and basic statistics.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
get_ipython().run_line_magic('matplotlib', 'inline')

# Load data
df = pd.read_csv('loan_train.csv')
df = df.drop(labels=["Unnamed: 0","Unnamed: 0.1"], axis=1)
df.head()

02. Data Cleaning & Preprocessing¶

Prepare the data for modeling: handle missing values, convert data types, and engineer features.

In [ ]:
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df['dayofweek'] = df['effective_date'].dt.dayofweek
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
df['Gender'].replace(to_replace=['male','female'], value=[0,1], inplace=True)

03. Exploratory Data Analysis (EDA)¶

Visualize distributions and relationships in the data.

In [ ]:
bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
In [ ]:
bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()

04. Feature Engineering¶

Transform categorical variables and select features for modeling.

In [ ]:
Feature = df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature, pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis=1, inplace=True)
X = Feature
y = df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[1,0]).values

05. Data Normalization & Splitting¶

Standardize features and split the data into training and test sets.

In [ ]:
X = preprocessing.StandardScaler().fit(X).transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

06. Model Training: K-Nearest Neighbors (KNN)¶

Train and evaluate a KNN classifier. Experiment with different values of k.

In [ ]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

Ks = 20
mean_acc = np.zeros((Ks-1))
for n in range(1, Ks):
    neigh = KNeighborsClassifier(n_neighbors=n).fit(X_train, y_train)
    yhat = neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
print("Best accuracy:", mean_acc.max(), "with k=", mean_acc.argmax()+1)
# Train final model
k = mean_acc.argmax()+1
loan_knn = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train)
yhat_knn = loan_knn.predict(X_test)

07. Model Training: Decision Tree¶

Train and evaluate a Decision Tree classifier.

In [ ]:
from sklearn.tree import DecisionTreeClassifier
loanTree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
loanTree.fit(X_train, y_train)
yhat_dt = loanTree.predict(X_test)

08. Model Training: Support Vector Machine (SVM)¶

Train and evaluate an SVM classifier.

In [ ]:
from sklearn import svm
loan_svm = svm.SVC()
loan_svm.fit(X_train, y_train)
yhat_svm = loan_svm.predict(X_test)

09. Model Training: Logistic Regression¶

Train and evaluate a Logistic Regression classifier.

In [ ]:
from sklearn.linear_model import LogisticRegression
loan_lr = LogisticRegression(C=0.01)
loan_lr.fit(X_train, y_train)
yhat_lr = loan_lr.predict(X_test)
yhat_prob_lr = loan_lr.predict_proba(X_test)

10. Model Evaluation & Comparison¶

Evaluate all models using accuracy, F1-score, and other metrics. Compare results and discuss findings.

In [ ]:
from sklearn.metrics import jaccard_score, f1_score, log_loss
print("KNN Jaccard:", jaccard_score(y_test, yhat_knn))
print("Decision Tree Jaccard:", jaccard_score(y_test, yhat_dt))
print("SVM Jaccard:", jaccard_score(y_test, yhat_svm))
print("Logistic Regression Jaccard:", jaccard_score(y_test, yhat_lr))

Personal Experimentation Space¶

Use this section to try new ideas, tune hyperparameters, or test additional algorithms. Document your experiments and insights here.

In [ ]:
# Example: Try a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
yhat_rf = rf.predict(X_test)
print("Random Forest Jaccard:", jaccard_score(y_test, yhat_rf))

Project-Oriented Challenge¶

Design and implement your own end-to-end loan classification project. Define your problem statement, preprocess data, engineer features, select and tune models, and present your results. Use the space below to outline and execute your project.