Sameen Salam- M.S. Candidate, Institute for Advanced Analytics at North Carolina State University
Mail: ssalam@ncsu.edu
In this notebook, I will take you through an analysis of banking data from a financial institution in Portugal. The data is from the UC Irvine Machine Learning Repository and consists of several datasets containing client, bank, and macroeconomic information from May 2008 to November 2010.
Here is an overview on how this notebook is structured:
Introduction
Understanding the Problem
Initial Setup
Functions
Data Acquisition
Exploratory Data Analysis
Preliminary Data Check
Variables
General Statistics
Modeling
Data Pre-Processing
Logistic Regression
XGBoost
Random Forest
Support Vector Machine
The financial institution seeks to predict whether or not a client will subscribe to a term deposit product via a direct marketing campaign. Using the data provided, I will create models to accurately predict client behavior and explain the factors that drive a successful marketing campaign in relation to the term deposit subscription. From there, the bank can use said models to predict outcomes and adjust their marketing strategy.
The UC Irvine repository contains four different banking datasets collected from a Portuguese banking institution. Two of the datasets (labeled with additional when imported), are newer and provide extra features related to the external macroecomic context of each client interaction, while the other two (unlabled) are older and do not include that extra information. I chose to move forward with the newer, more informative datasets.
The remaining datasets contained the exact same features, but one was ~41,000 rows, while the other was a ~4,100 row random subset of the larger set. I opted for the smaller dataset because it provides enough observations for modeling while also reducing overall computational time.
#Importing the necessary libraries
import requests
import io
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels
import numpy as np
import seaborn as sns
import xgboost as xgb
import matplotlib.patches as mpatches
from zipfile import ZipFile
from statsmodels.distributions.empirical_distribution import ECDF
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn import metrics
from imblearn.over_sampling import SMOTE
from sklearn.metrics import fbeta_score, make_scorer
from statsmodels.stats.outliers_influence import variance_inflation_factor
from IPython.display import display, HTML
These are additional functions I created or borrowed for the purposes of this analysis.
#Creating a function to download, read, and extract zipfile contents. Note that this downloads into memory and into the
#same directory as this notebook: data_reader
def data_reader(url, output_file_name):
'''Download and extract zip files into local script directory.
Keyword Arguments:
url -- string representing url path or location
output_file_name -- folder name into which contents of the zip will be written
Note: If zip file contains folders when extracted, they will exist below the level of output_file_name folder.
'''
my_file = requests.get(url)
output = open(output_file_name, "wb")
output.write(my_file.content)
output.close()
with ZipFile(output_file_name, "r") as zip:
zip.printdir()
zip.extractall()
#Creating a convenient function to plot empirical cumulative distribution plots for continuous
#variables
def ecdf_plot(data, variable, x_lab, display_lines = True):
'''Plot empirical cumulative distribution function of a numerical variable
and plot mean (red) and median (green) lines
Keyword Arguments:
data -- pandas Dataframe
variable -- column name (string) **must be a numerical input**
x_lab -- X axis label (string)
display_lines -- True or False
'''
sns.set()
ecdf = ECDF(data[variable])
_= plt.plot(ecdf.x, ecdf.y, marker = ".", linestyle = "none", alpha = 0.2)
_= plt.xlabel(x_lab)
_= plt.ylabel("Cumulative Density")
if (display_lines == True):
_= plt.vlines(np.mean(data[variable]), ymax=1,ymin=0,colors='r')
_= plt.vlines(np.median(data[variable]), ymax = 1, ymin = 0, colors = 'g')
#Creating a convenient function to plot my bar graphs in a consistent way
def mybarplot(data, variable, x_lab, x_tick_rotation = 360, reorder = False, correct_index = 0):
'''Plot standardized bar graphs
Keyword Arguments:
data -- pandas Dataframe
variable -- column name (string) **Must be a categorical input**
x_lab -- X axis label (string)
x_tick_rotation -- rotation of the x-axis labels (numeric)
reorder -- Whether or not you want to reorder the bars on the graph (bool)
correct_index -- A list of strings representing the levels of the categorical variable
(string or list of strings)
'''
sns.set()
temp = data[variable].value_counts().to_frame()
if (reorder == True):
temp = temp.reindex(correct_index)
temp = temp.reset_index()
_= temp.plot.bar(x = "index", y = variable, color = "mediumseagreen", edgecolor = "black")
_= plt.ylabel("Count")
_= plt.xlabel(x_lab)
_= plt.xticks(rotation = x_tick_rotation)
#Decision cutoff report generator
def threshold_optimization(y_probas, y_real, \
decision_thresholds = [x / 10.0 for x in range(1, 10, 1)]):
'''Report confusion matrix and model metrics for probability decision cutoffs
Keyword Arguments:
y_probas -- list of predicted probabilities for the positive class (np array)
y_real -- true class labels of the validation or test data (dataframe)
decision_thresholds -- positive class probability at which the model will label an observation
a 1 (numeric or list of numeric)
'''
class_labels = pd.DataFrame(index=decision_thresholds, columns=range(len(y_probas)))
for thresh in decision_thresholds:
temp = [1 if x >= thresh else 0 for x in y_probas]
class_labels.loc[thresh,:] = temp
for i in range(len(class_labels)):
print("-----------------------------Decision Cutoff:",class_labels.index[i])
print(pd.DataFrame(
metrics.confusion_matrix(y_real, np.int64(class_labels.iloc[i,:])),
columns=['pred_neg', 'pred_pos'],
index=['neg', 'pos']))
print(metrics.classification_report(y_real,np.int64(class_labels.iloc[i,:])))
#Written by Stackoverflow user Max Ghenis
def calculate_vif_(X, thresh=5.0):
variables = list(range(X.shape[1]))
dropped = True
while dropped:
dropped = False
vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
for ix in range(X.iloc[:, variables].shape[1])]
maxloc = vif.index(max(vif))
if max(vif) > thresh:
print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
'\' at index: ' + str(maxloc))
del variables[maxloc]
dropped = True
#print('Remaining variables:')
#print(X.columns[variables])
return X.iloc[:, variables]
#Getting the url from the repository for the additional datasets
url_additional = "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"
#Using the data reader function to get the additional bank data and load it onto the same
#directory as this notebook
data_reader(url= url_additional, output_file_name="bank_additional.zip")
#Reading in the smaller of the two "additional" datasets: bank_additional
bank_additional = pd.read_csv("bank-additional/bank-additional.csv", delimiter = ";")
Taking a quick look at the data.
#Get the quick look at the data
bank_additional.head()
#Getting rid of any duplicates
bank_additional = bank_additional.drop_duplicates(keep="first")
#Getting dataset information
bank_additional.info()
#Number of unique values for each variable
bank_additional.nunique()
Before exploring relationships between different features of the data, let's take a look at them individually. Our goal here is to get an understanding of center and spread for numerical variables, and class distribution for categorical variables.
#Getting descriptives for each numeric variable
bank_additional.describe()
The plot below shows the empirical cumulative density function (ECDF) plot for the age variable. The mean (40.1 years) is indicated by the red vertical line and the median (38 years) by the green vertical line. From this plot, it is apparent that age is right skewed. You can also see that ~80% of customers are between 30 and 60 years of age, and that ~40% of customers are between 30 and 40 years of age.
#Calling the ecdf_plot function to plot empirical cumulative distribution function
ecdf_plot(data=bank_additional, variable="age", x_lab="Age of Client")
Another look at age using a density histogram overlaid plot. Here we can see the distribution is right skewed and somewhat bell-shaped.
#Density plot and histogram of age
_= sns.distplot(bank_additional["age"],hist=True,color='orange')
_= plt.xlabel("Age of Client")
_= plt.ylabel("Density")
The plot below is the ECDF plot for duration. You can see that in almost all cases of contact with customers, duration remains below 1000 seconds. Additionally, ~90% of calls last less than 500 seconds. The mean duration is 256.8 seconds and the median is 181 seconds.
ecdf_plot(data=bank_additional, variable="duration", x_lab="Duration of Last Contact (seconds)")
The ECDF plot below shows the empirical cumulative density function plot for campaign. Based on the figure, you can see that campaign is right skewed. Just over 40% of customers were contacted once for the current marketing campaign, while ~90% were contacted 5 times or less.
ecdf_plot(data=bank_additional, variable="campaign",\
x_lab="Number of Contact Attempts",display_lines=False)
Based on the ECDF plot below of pdays, it is evident that the vast majority of observations (over 95%) represent first time contact with clients, shown by the line drawn at 999 days since last contact.
ecdf_plot(data=bank_additional, variable="pdays",\
x_lab="Number of Days Since Last Contact",display_lines=False)
The following density histogram plot shows the distribution of the little blip in pdays in the previous plot (bottom left corner). The following plot illustrates the distribution of pdays for those customers that had been contacted in the past (non-999 value). The distribution seems to be bimodal.
bank_additional_realpdays = bank_additional.copy()
bank_additional_realpdays = bank_additional[bank_additional.pdays != 999]
_= sns.distplot(bank_additional_realpdays["pdays"],hist=True, kde=True, color='orange')
_= plt.xlabel("Number of Days Since Last Contact")
_= plt.ylabel("Density")
The plot below shows the ECDF of previous. The majority of clients have not been contacted for previous campaigns.
ecdf_plot(data=bank_additional, variable="previous",\
x_lab="Number of Previous Contact Attempts",display_lines=False)
The following ECDF plot of emp.var.rate shows that the rate takes on a few values, with most density lying at around -1.8 and above 1.
ecdf_plot(data=bank_additional, variable="emp.var.rate",\
x_lab="Employment Variation Rate", display_lines=False)
The following ECDF plot of cons.price.idx shows little variation. Consumer Price Index varies between ~92 and ~95 for all values in the dataset, this variable may not prove to be very informative for that reason.
ecdf_plot(data=bank_additional, variable="cons.price.idx",\
x_lab="Consumer Price Index Value",display_lines=False)
The following ECDF plot of cons.conf.idx shows that most observations are clustered around -42 and -37, indicating a generally negative view of the economy.
ecdf_plot(data=bank_additional, variable="cons.conf.idx",\
x_lab="Consumer Confidence Index Value",display_lines=False)
The following ECDF plot of euribor3month shows that the rate is strongly bi-modal, with a lot of values around 1-1.5% and ~5%.
ecdf_plot(data=bank_additional, variable="euribor3m",\
x_lab="Euribor 3 Month Rate", display_lines=False)
The following ECDF plot of nr.employed shows that ~40% of clients interacted with the bank when it had over 5200 employees.
ecdf_plot(data=bank_additional, variable="nr.employed",\
x_lab="Number of Employees",display_lines=False)
The following plot is a bar graph of the frequencies of the different values of poutcome (non-existent, failure, success). Unsuprisingly, nonexistent dominates this attribute, as most customers in the dataset are considered new customers. This is congruent with 0 being dominant in previous (representing a customer previously contacted 0 times) and 999 being dominant in pdays (representing a customer who's never been contacted at all in their days with the bank). These new customers have never had the option to accept or deny a product.
#Calling the mybarplot function to graph poutcome
mybarplot(data=bank_additional, variable="poutcome",\
x_lab="Outcome of Previous Marketing Campaign", reorder=True,\
correct_index=["failure", "success", "nonexistent"])
The following bar graph shows how job is broken down across the dataset. Administrative, blue-collar, and technician occupations are the most common in the data.
mybarplot(data=bank_additional, variable="job", x_lab="Occupation", x_tick_rotation=90)
The following bar graph shows how marital is broken down across the dataset. Married clients make up over half of the observations.
mybarplot(data=bank_additional, variable="marital", x_lab="Marital Status")
The following bar graph shows the frequencies of different levels of education. The most common educational attainment is a university degree with over 1200 observations, but it does not make up a majority of clients.
mybarplot(data=bank_additional, variable="education", x_lab="Educational Attainment",\
x_tick_rotation=90)
The bar graph of default shown below illustrates that most customers do not have any credit in default. A significant minority of the observations, however, are listed as unknown, which may prove to be interesting with further analysis.
mybarplot(data=bank_additional, variable="default", x_lab="Credit Default Status", reorder=True,\
correct_index=["no", "yes", "unknown"])
The bar graph of housing shown below indicates that clients are somewhat evenly split between having or not having a housing loan.
mybarplot(data=bank_additional, variable="housing", \
x_lab="Housing Loan Possession", reorder=True,\
correct_index=["no", "yes", "unknown"])
The bar graph below of loan shows that most customers do not have a personal loan.
mybarplot(data=bank_additional, variable="loan", x_lab="Personal Loan Possession")
The bar graph below of contact shows that more clients are contacted via cell phone, but a strong minority of clients stick to telephone interactions.
mybarplot(data=bank_additional, variable="contact", x_lab="Contact Medium")
The bar graph of day_of_week, shown below, indicates an even spread of client contact over the course of the work week.
mybarplot(data=bank_additional, variable="day_of_week",\
x_lab="Day of Week of Last Contact",reorder=True,\
correct_index=["mon","tue","wed", "thu", "fri"])
The bar graph below showing the breakdown of month. Only 10 months are shown here because no clients were last contacted in the months of January or February. Additionally, the months of March, September, October, and December have relatively few observations. May has significantly more records of client contact than any other month.
mybarplot(data=bank_additional, variable="month", x_lab="Month of Last Contact", reorder=True,\
correct_index=["mar","apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"])
The barplot below shows the breakdown of y, our target variable. We see that 3,668 clients did not sign up for the term deposit subscription and 451 did (89.05%-10.95% split).
#Bar plot of the target variable y
mybarplot(data=bank_additional, variable="y", x_lab="Target Variable")
#Table with frequencies and percentages of levels of y: new_table
new_table = bank_additional["y"].value_counts().to_frame()
new_table["y_percentage"] = [round(float(new_table.loc["no"]/sum(new_table["y"])),4)*100, \
round(float(new_table.loc["yes"]/sum(new_table["y"])),4)*100]
new_table
Now that we have a thorough understanding of all of the features present in our dataset, we can move onto more sophisticated visuals involving multiple variables.
The visual below is a heatmap of the correlation matrix of all numerical features in the data. Reddish tones indicate positive correlations, while blueish tones indicate negative correlation. Note that euribor3, emp.var.rate, and nr.employed are all very strongly correlated with each other, mostly because they are all linked to the same thing: macroeconomic context.
#Getting correlation matrix of numerical variables: corr
corr = bank_additional.corr()
#Plotting correlation matrix heatmap for visualization
_= sns.heatmap(corr, vmax=1, vmin = -1, center=0,\
square=True, linewidths=.5, cbar_kws={"shrink": .5},\
annot=True, fmt='.2f', cmap='coolwarm')
plt.xticks(rotation = 45)
sns.despine()
_.figure.set_size_inches(14,10)
plt.show()
Below is a boxplot of age broken down by job. Most occupations fall into similar age ranges with the exceptions of retirees, who tend to be older, and students, who tend to be younger.
#Plotting boxplot of client age vs job
_= sns.boxplot(x="job", y="age",data=bank_additional)
_= plt.xlabel("Occupation")
_= plt.ylabel("Age of Client")
_= plt.xticks(rotation = 90)
Below is a boxplot of age broken down by default. Those with unknown credit in default are slightly older than those with no record of credit default.
#Plotting boxplot of client age vs default
_= sns.boxplot(x="default", y="age",data=bank_additional)
_= plt.xlabel("Default Status")
_= plt.ylabel("Age of Client")
Below is a stacked proportions chart illustrating the breakdown of y over the values of month. The months of March, September, October, and December have relatively high proportions of clients signing up for the term deposit subscription. It is important to note that those months have the smallest numbers of observations as well, so that could be contributing to these relatively high proportions.
#Creating proportion plot of target variable and month
temp = pd.crosstab(bank_additional["month"], bank_additional["y"])
temp["sum"] = temp.sum(axis=1)
temp.loc[:,"no":"yes"] = temp.loc[:,"no":"yes"].div(temp["sum"], axis=0)
temp = temp.drop(["sum"], axis = 1)
temp = temp.reindex(["mar","apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"])
temp
_= temp.plot(kind = "bar", stacked = True)
_= plt.xticks(rotation = 360)
_= plt.xlabel("Month")
_= plt.ylabel("Proportion")
The proportion plot below shows the breakdown of poutcome over the levels of the target variable, y. You can see that clients who were classed as "success" in previous marketing campaigns make up a higher proportion of those who signed up for the term deposit subscription on this marketing campaign.
#Creating proportion plot of target variable and poutcome
temp = pd.crosstab(bank_additional["y"], bank_additional["poutcome"])
temp["sum"] = temp.sum(axis=1)
temp.loc[:,"failure":"success"] = temp.loc[:,"failure":"success"].div(temp["sum"], axis=0)
temp = temp.drop(["sum"], axis = 1)
_= temp.plot(kind = "bar", stacked = True)
_= plt.xticks(rotation = 360)
_= plt.xlabel("Target Variable")
_= plt.ylabel("Proportion")
As mentioned by the publishers of the data, the duration variable cannot be used for predictive modeling. This is because someone who either does not pick up the phone (duration = 0 seconds) or hangs up too quickly (duration is small) will obviously not have signed up for the term deposit (y = "no"). Additionally, it is not possible to know the duration of a call until it is over, at whichpoint the outcome is known. With that being said, this boxplot below showing call lengths broken down by the target variable shows that the "yes" group stay on the phone for longer with bank employees. Even though this variable cannot be used in the modeling process, the bank can still inform its employees that if they manage to get a client to pick up the phone, they should aim to keep the conversation going for as long as possible.
#Plotting boxplot of call duration vs target variable
_= sns.boxplot(x="y", y="duration",data=bank_additional)
_= plt.xlabel("Target Variable")
_= plt.ylabel("Duration of Last Contact")
The data as is will not work for modeling. We need to clean and posture it by encoding and dropping variables, splitting it, and oversampling the training set to overcome the target class imbalance.
full_data = bank_additional.copy()
The first step is to properly encode our target variable in a way that classification algorithms can work with. Here, we create a binary indicator target variable called target in place of the y variable.
#Encoding the target variable (y) properly with numeric 1,0
full_data['target'] = [0 if x =='no' else 1 for x in full_data['y']]
full_data = full_data.drop(["y"], axis= 1)
As mentioned previously duration cannot be a part of the modeling process, so we must drop it from our dataset.
#Drop duration
full_data = full_data.drop(["duration"], axis=1)
The pdays variable is challenging to deal with in its current form because of the imputation method used. It represents a numerical quantity, but 999 is used for missing values. I have elected to replace all 999s in the original pdays column with the mean value of the remaining non-999 values. Additionally, I will create binary indicator, pdays_missing, to denote those observations that originally had a 999 for pdays.
#Re-encoding the pdays variable
full_data["pdays_missing"] = [1 if x == 999 else 0 for x in full_data['pdays']]
pdays_mean = np.mean(full_data[full_data.pdays != 999].pdays)
full_data = full_data.replace(999, pdays_mean)
Now we need to scale all of the numerical attributes to reduce the potential for unequal effects. For example, emp.var.rate operates on the 1s scale, while nr.employed operates in the 1000s.
#Scaling all numerical columns using sklearn Scaler
scaler = StandardScaler()
numerical_vars_list = ["age", "pdays","previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx",\
"euribor3m", "nr.employed"]
#Iterating through and scaling each member of the numerical_vars_list
for var in numerical_vars_list:
temp = scaler.fit_transform(full_data[var].to_frame())
full_data[var] = temp
Below is a heatmap of all of the existing (now scaled) numerical features and the newly encoded binary features, including the newly encoded target. There are some extremely strong correlations, such as that between euribor3m and emp.var.rate.
#Correlation heatmap for the standardized and enumerated features (including target)
corr = full_data.corr()
_= sns.heatmap(corr, vmax=1, vmin = -1, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt='.2f', cmap='coolwarm')
plt.xticks(rotation = 45)
sns.despine()
_.figure.set_size_inches(14,10)
plt.show()
Now that we have encoded and scaled our inputs, we have to create dummy variables for the categorical variables.
#Get dummy variables
full_data = pd.get_dummies(full_data)
Some of the resulting dummy variables contribute the same information, so here I am using the calculatevif function (written by Max Ghenis) to take out those variables with a variance inflation factor of 5 or higher.
full_data = calculate_vif_(full_data)
Now we have to split the data into train, test and validation. I opted for 80% training, 20% validation, and 20% test.
#Splitting data into predictors and labels
X = full_data.loc[:, full_data.columns != 'target']
y = full_data.loc[:, full_data.columns == 'target']
#Splitting into test train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\
random_state=1)
#Splitting into train validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,\
random_state=1)
A major concern regarding this classification problem is the imbalance between the majority and minority class. About 90% of clients did not sign up for the term deposit (target = 0), while only 10% did (target = 1). To correct for this, I used synthetic minority oversampling (SMOTE) to balance the training data with a 50-50 split between the 0s and the 1s.
#Synthetic minority oversampling on training data only
sm = SMOTE(random_state=1)
X_train, y_train = sm.fit_resample(X_train, y_train)
The code below takes us through the logistic regression model I ran. In the first part of the code, I am instantiating the logistic regression estimator (with a higher maximum iterations due to convergence concerns), defining the hyperparameters and scoring metric for gridsearch. I then use the optimized logistic regression model to predict the validation data and display those performance metrics below. I have also included performance metrics for several decision cutoffs between 0.3 and 0.75, to allow decision makers even more flexibility in their decision with this particular model.
#Creating an estimator
logistic_model = LogisticRegression(random_state=1,max_iter=1000)
#Creating hyperparameter grid
C = np.logspace(-4, 1, num=5,base=10)
hyperparameters = dict(C=C)
#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)
#Conducting the gridsearch with 3 fold cross validation
grid = GridSearchCV(logistic_model, hyperparameters, cv=3, verbose=0, scoring=alternate_scorer)
best_logistic_model = grid.fit(X_train, np.ravel(y_train))
#print('Best C:', best_logistic_model.best_estimator_.get_params()['C'])
#Getting predicted probabilites out of the tuned logistic model
y_scores_logistic = best_logistic_model.predict(X_val)
#Output confusion matrix and score report for logistic regression
print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_logistic),\
columns=['pred_neg', 'pred_pos'],\
index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_logistic))
Below is the output for different decision thresholds. Using this output, the bank can make a decision as to what balance they want to strike with calling too many people who will say no and minimzing losses from people who would have said yes had they been called.
y_logprobs = best_logistic_model.predict_proba(X_val)[:,1]
#Using threshold_optimization function (check Functions section) to generate report of probability cutoffs
#and their associated metrics
threshold_optimization(y_probas=y_logprobs, y_real=y_val,\
decision_thresholds = [x/100.0 for x in range(30,80,5)])
In the code below, I get the logistic regression coefficients from the model. These coefficients have the benefit of interpretability, unlike the other models in this notebook.
coefficients = pd.concat([pd.DataFrame(X_train.columns),\
pd.DataFrame(np.transpose(best_logistic_model.best_estimator_.coef_))],\
axis = 1)
coefficients.columns = ["Feature", "Coefficient"]
coefficients.head()
The code below takes us through the XGBoost model I ran. Much like how the logistic regression code is structured, I define the model, hyperparameters to try, and the scoring method for gridsearch. I run the optimized model on the validation data and output the performance metrics below. Note that XGBoost has a decision threshold (base_score) already built in as a parameter, so we will not have to use the threshold_optimization function from logistic.
xgboost_model = xgb.XGBClassifier(random_state=1)
#Creating hyperparameter grid
base_score = [x/10.0 for x in range(3,7,1)]
n_estimators = [10,100,1000]
max_depth = range(1,10,3)
learning_rate = [0.01, 0.1]
min_child_weight = [1,2]
hyperparameters = dict(n_estimators=n_estimators, max_depth = max_depth,\
learning_rate = learning_rate, min_child_weight = min_child_weight)
#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)
#Creating grid object
grid = GridSearchCV(xgboost_model, hyperparameters, cv=3, verbose=0, scoring= alternate_scorer)
#Fitting grid object xgboost model with training data to select best model
best_xgboost_model = grid.fit(X_train, np.ravel(y_train))
#Getting predicted probabilites out of the tuned xgboost model
y_scores_xgboost = best_xgboost_model.predict(X_val)
#Output confusion matrix and score report for xgboost
print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_xgboost),\
columns=['pred_neg', 'pred_pos'],\
index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_xgboost))
The code below takes us through the random forest model. First I instantiate the classifier, then specify hyperparameters and conduct a gridsearch. I run the model on the validation data and output that performance below.
random_forest_model = RandomForestClassifier(random_state=1)
#Creating hyperparameter grid
n_estimators = [10,100,1000]
max_features = range(1,20,3)#Change up to 20 in the next iteration once the rest is figured out
max_depth = [10,100]
min_samples_split = [2,5,10]
min_samples_leaf = [1,2,4]
bootstrap = [True, False]
hyperparameters = dict(n_estimators=n_estimators, max_features = max_features,\
max_depth = max_depth, min_samples_split = min_samples_split,\
min_samples_leaf = min_samples_leaf, bootstrap = bootstrap)
#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)
#Creating grid object
grid = GridSearchCV(random_forest_model, hyperparameters, cv=3, verbose=0,\
scoring= alternate_scorer)
#Fitting grid object xgboost model with training data to select best model
best_rf_model = grid.fit(X_train, np.ravel(y_train))
#Getting predictions out of the tuned xgboost model
y_scores_rf = best_rf_model.predict(X_val)
print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_rf),\
columns=['pred_neg', 'pred_pos'],\
index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_rf))
The code takes us through the support vector machine model. Like the other models, I instantiate the classifier, specify hyperparameters, run a gridsearch, and predict on the validation and display the results. This model is usually reserved for data with a very large number of features and a relatively small number of observations, but I thought I would give it a try just to see how it compares to the other models.
support_vector_machine_model = SVC(random_state=1, probability=True)
#Creating hyperparameter grid
C = np.logspace(-4, 1, num=5,base=10)
kernel = ["rbf", "poly"]
gamma = ["scale", "auto"]
hyperparameters = dict(C = C, kernel = kernel, gamma = gamma)
#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)
#Creating grid object
grid = GridSearchCV(support_vector_machine_model, hyperparameters, cv=3, verbose=0,\
scoring= alternate_scorer)
#Fitting grid object svm model with training data to select best model
best_svm_model = grid.fit(X_train, np.ravel(y_train))
#Getting predictions out of the tuned xgboost model
y_scores_svm = best_svm_model.predict(X_val)
print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_svm),\
columns=['pred_neg', 'pred_pos'],\
index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_svm))
Below we compare and select the models based on the area under the Precision Recall Curve (PRC). I recommend using the AUC of the PRC for this dataset because of the class imbalance present in the target variable. We examine the performance on test data and training data. These comparisons will tell us a few things: which model is best for the problem, how well they will likely perform on unseen data, and how much the models overfit the training data.
#Logistic regression predictions and probabilities on test data and train data
test_probs_logistic = best_logistic_model.predict_proba(X_test)[:,1]
train_probs_logistic = best_logistic_model.predict_proba(X_train)[:,1]
#XG Boost predictions and probabilities on test data and train data
test_probs_xgboost = best_xgboost_model.predict_proba(X_test)[:,1]
train_probs_xgboost = best_xgboost_model.predict_proba(X_train)[:,1]
#Random Forest predictions and probabilities on test data and train data
test_probs_rf = best_rf_model.predict_proba(X_test)[:,1]
train_probs_rf = best_rf_model.predict_proba(X_train)[:,1]
#Support Vector Machine predictions and probabilities on test data and train data
test_probs_svm = best_svm_model.predict_proba(X_test)[:,1]
train_probs_svm = best_svm_model.predict_proba(X_train)[:,1]
#Getting the precision and recall for each model on the test data
prec_lr, rec_lr, _= metrics.precision_recall_curve(y_true=y_test,\
probas_pred= test_probs_logistic)
prec_xgb, rec_xgb, _= metrics.precision_recall_curve(y_true=y_test,\
probas_pred=test_probs_xgboost)
prec_rf, rec_rf, _= metrics.precision_recall_curve(y_true=y_test,\
probas_pred= test_probs_rf)
prec_svm, rec_svm, _= metrics.precision_recall_curve(y_true=y_test,\
probas_pred=test_probs_svm)
#Getting the precision and recall for each model on the train data
prec_lr_tr, rec_lr_tr, _= metrics.precision_recall_curve(y_true=y_train,\
probas_pred= train_probs_logistic)
prec_xgb_tr, rec_xgb_tr, _= metrics.precision_recall_curve(y_true=y_train,\
probas_pred=train_probs_xgboost)
prec_rf_tr, rec_rf_tr, _= metrics.precision_recall_curve(y_true=y_train,\
probas_pred= train_probs_rf)
prec_svm_tr, rec_svm_tr, _= metrics.precision_recall_curve(y_true=y_train,\
probas_pred=train_probs_svm)
The plots below show each model's performance on the test data using the PRC plots. Based on the PRC score of 0.355, the XGBoost model is the best model.
_=plt.plot(rec_lr, prec_lr, color='C0')
_=plt.plot(rec_xgb, prec_xgb, color= 'C1')
_=plt.plot(rec_rf, prec_rf, color = 'C2')
_=plt.plot(rec_svm, prec_svm, color = 'C3')
#Tweaking some visualization characteristics, including adding the central line
_= plt.xlim([-0.01, 1.0])
_= plt.ylim([0.0, 1.05])
#_= plt.plot([0, 1], [0, 1], 'k--')
_= plt.xlabel('Recall')
_= plt.ylabel('Precision')
_= plt.title('Precision Recall Curve: Test Data')
#This is all legend formatting
blue_patch = mpatches.Patch(color='C0',\
label=('Logistic Regression: '\
+str(round(metrics.auc(rec_lr,prec_lr),ndigits=3))))
orange_patch = mpatches.Patch(color='C1',\
label=('XG Boost: '\
+str(round(metrics.auc(rec_xgb,prec_xgb),ndigits=3))))
green_patch = mpatches.Patch(color='C2',\
label=('Random Forest: '\
+str(round(metrics.auc(rec_rf,prec_rf),ndigits=3))))
red_patch = mpatches.Patch(color='C3',\
label=('Support Vector Machine: '\
+str(round(metrics.auc(rec_svm,prec_svm,),ndigits=3))))
plt.legend(handles=[blue_patch, orange_patch,green_patch, red_patch])
#Changing the figure size from the default
plt.rcParams["figure.figsize"] = [8,10]
plt.show()
The plot below shows the performance of each model on the training data. Based on the disparity between the test and train performance on the PRC, it is evident that the models I made seriously overfit the training data and did not generalize well to the test.
_=plt.plot(rec_lr_tr, prec_lr_tr, color='C0')
_=plt.plot(rec_xgb_tr, prec_xgb_tr, color= 'C1')
_=plt.plot(rec_rf_tr, prec_rf_tr, color = 'C2')
_=plt.plot(rec_svm_tr, prec_svm_tr, color = 'C3')
#Tweaking some visualization characteristics, including adding the central line
_= plt.xlim([-0.01, 1.01])
_= plt.ylim([0.0, 1.05])
#_= plt.plot([0, 1], [0, 1], 'k--')
_= plt.xlabel('Recall')
_= plt.ylabel('Precision')
_= plt.title('Precision Recall Curve: Training Data')
#This is all legend formatting
blue_patch = mpatches.Patch(color='C0',\
label=('Logistic Regression: '\
+str(round(metrics.auc(rec_lr_tr,prec_lr_tr),ndigits=3))))
orange_patch = mpatches.Patch(color='C1',\
label=('XG Boost: '\
+str(round(metrics.auc(rec_xgb_tr,prec_xgb_tr),ndigits=3))))
green_patch = mpatches.Patch(color='C2',\
label=('Random Forest: '\
+str(round(metrics.auc(rec_rf_tr,prec_rf_tr),ndigits=3))))
red_patch = mpatches.Patch(color='C3',\
label=('Support Vector Machine: '\
+str(round(metrics.auc(rec_svm_tr,prec_svm_tr),ndigits=3))))
plt.legend(handles=[blue_patch, orange_patch,green_patch, red_patch])
#Changing the figure size from the default
plt.rcParams["figure.figsize"] = [8,10]
plt.show()
The table below shows the top five feature importances for the XGBoost model and their corresponding logistic regression coefficients for some added interpretability.
#Creating a dataframe containing the features, random forest importance, and xgboost importance
fi = pd.concat([pd.DataFrame(X_train.columns),\
pd.DataFrame(np.transpose(best_xgboost_model.best_estimator_.feature_importances_))],\
axis = 1)
fi.columns = ["Feature","XGB Importance"]
#Creating a dataframe with a new column containing the logistic regression coefficient
result = pd.merge(fi,
coefficients["Coefficient"],
on=fi.Feature,
how='inner')
result = result.drop("key_0", axis = 1)
most_important_xgb = result.sort_values("XGB Importance", ascending = False)
most_important_xgb.head()
Suggestions for future modeling improvements:
1) Reconsider sampling strategy. SMOTE oversampling may have introduced additional noise that the models are overfitting to. Undersampling may be a viable alternative to try and reduce overfitting.
2) Reconsider hyperparameters , especially the depth parameter for XGBoost and Random Forest and regularization for Logistic Regression and SVM. Reducing the maximum depth/reducing the C parameter (inverse regularization parameter) may introduce bias and reduce variance of the models, thereby reducing overfitting.
3) Implement feature selection/feature engineering. Working with domain experts to categorize features into related business factors would potentially eliminate excessive variables from the modeling process. Additionally, input from those stakeholders could inform new features that better concentrate the signal from the data.
Here are the final recommendations to the bank from this analysis:
1) The bank should inform its employees to keep the conversation going for as long as possible. While it could not be used in the modeling process, the impact of duration cannot be ignored. Clients who subscribed to the term deposit generally stayed on the phone longer with bank employees.
2) Use the XGBoost model to predict client response and inform areas of importance in tandem with the logistic regression to guide strategy. Here is an example using the top variable (nr.employed) and its corresponding logistic regression coefficient:
The XGBoost model found nr.employed to be the most significant feature in determining whether or not a client will subscribe to a term deposit. Looking the logistic regression coefficient, a higher number of people employed by the bank is associated with a lower chance that said client will subscribe to the term deposit through direct marketing, holding all else equal. Hiring more employees will not improve term deposit subscriptions.