Loan Classification: Topic-Based Exploration¶
Welcome! This notebook is organized by key machine learning topics, progressing from basic to advanced, and is designed for hands-on experimentation and project work. You are encouraged to explore, modify, and extend each section.
01. Data Loading & Exploration¶
Load the dataset and perform initial exploration. Understand the data structure and basic statistics.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
get_ipython().run_line_magic('matplotlib', 'inline')
# Load data
df = pd.read_csv('loan_train.csv')
df = df.drop(labels=["Unnamed: 0","Unnamed: 0.1"], axis=1)
df.head()
02. Data Cleaning & Preprocessing¶
Prepare the data for modeling: handle missing values, convert data types, and engineer features.
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df['dayofweek'] = df['effective_date'].dt.dayofweek
df['weekend'] = df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
df['Gender'].replace(to_replace=['male','female'], value=[0,1], inplace=True)
03. Exploratory Data Analysis (EDA)¶
Visualize distributions and relationships in the data.
bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'Principal', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
bins = np.linspace(df.age.min(), df.age.max(), 10)
g = sns.FacetGrid(df, col="Gender", hue="loan_status", palette="Set1", col_wrap=2)
g.map(plt.hist, 'age', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
04. Feature Engineering¶
Transform categorical variables and select features for modeling.
Feature = df[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature, pd.get_dummies(df['education'])], axis=1)
Feature.drop(['Master or Above'], axis=1, inplace=True)
X = Feature
y = df['loan_status'].replace(to_replace=['PAIDOFF','COLLECTION'], value=[1,0]).values
05. Data Normalization & Splitting¶
Standardize features and split the data into training and test sets.
X = preprocessing.StandardScaler().fit(X).transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
06. Model Training: K-Nearest Neighbors (KNN)¶
Train and evaluate a KNN classifier. Experiment with different values of k.
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
Ks = 20
mean_acc = np.zeros((Ks-1))
for n in range(1, Ks):
neigh = KNeighborsClassifier(n_neighbors=n).fit(X_train, y_train)
yhat = neigh.predict(X_test)
mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
print("Best accuracy:", mean_acc.max(), "with k=", mean_acc.argmax()+1)
# Train final model
k = mean_acc.argmax()+1
loan_knn = KNeighborsClassifier(n_neighbors=k).fit(X_train, y_train)
yhat_knn = loan_knn.predict(X_test)
07. Model Training: Decision Tree¶
Train and evaluate a Decision Tree classifier.
from sklearn.tree import DecisionTreeClassifier
loanTree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
loanTree.fit(X_train, y_train)
yhat_dt = loanTree.predict(X_test)
08. Model Training: Support Vector Machine (SVM)¶
Train and evaluate an SVM classifier.
from sklearn import svm
loan_svm = svm.SVC()
loan_svm.fit(X_train, y_train)
yhat_svm = loan_svm.predict(X_test)
09. Model Training: Logistic Regression¶
Train and evaluate a Logistic Regression classifier.
from sklearn.linear_model import LogisticRegression
loan_lr = LogisticRegression(C=0.01)
loan_lr.fit(X_train, y_train)
yhat_lr = loan_lr.predict(X_test)
yhat_prob_lr = loan_lr.predict_proba(X_test)
10. Model Evaluation & Comparison¶
Evaluate all models using accuracy, F1-score, and other metrics. Compare results and discuss findings.
from sklearn.metrics import jaccard_score, f1_score, log_loss
print("KNN Jaccard:", jaccard_score(y_test, yhat_knn))
print("Decision Tree Jaccard:", jaccard_score(y_test, yhat_dt))
print("SVM Jaccard:", jaccard_score(y_test, yhat_svm))
print("Logistic Regression Jaccard:", jaccard_score(y_test, yhat_lr))
Personal Experimentation Space¶
Use this section to try new ideas, tune hyperparameters, or test additional algorithms. Document your experiments and insights here.
# Example: Try a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
yhat_rf = rf.predict(X_test)
print("Random Forest Jaccard:", jaccard_score(y_test, yhat_rf))
Project-Oriented Challenge¶
Design and implement your own end-to-end loan classification project. Define your problem statement, preprocess data, engineer features, select and tune models, and present your results. Use the space below to outline and execute your project.