Calling Clients: Predicting Marketing Campaign Outcomes¶

Sameen Salam- M.S. Candidate, Institute for Advanced Analytics at North Carolina State University
Mail: ssalam@ncsu.edu

Introduction¶

In this notebook, I will take you through an analysis of banking data from a financial institution in Portugal. The data is from the UC Irvine Machine Learning Repository and consists of several datasets containing client, bank, and macroeconomic information from May 2008 to November 2010.

Here is an overview on how this notebook is structured:

    Introduction
        Understanding the Problem
        Initial Setup
        Functions
        Data Acquisition
    Exploratory Data Analysis
        Preliminary Data Check
        Variables
        General Statistics
    Modeling
        Data Pre-Processing
        Logistic Regression
        XGBoost
        Random Forest
        Support Vector Machine

Understanding the Problem¶

The financial institution seeks to predict whether or not a client will subscribe to a term deposit product via a direct marketing campaign. Using the data provided, I will create models to accurately predict client behavior and explain the factors that drive a successful marketing campaign in relation to the term deposit subscription. From there, the bank can use said models to predict outcomes and adjust their marketing strategy.

Initial Setup¶

The UC Irvine repository contains four different banking datasets collected from a Portuguese banking institution. Two of the datasets (labeled with additional when imported), are newer and provide extra features related to the external macroecomic context of each client interaction, while the other two (unlabled) are older and do not include that extra information. I chose to move forward with the newer, more informative datasets.

The remaining datasets contained the exact same features, but one was ~41,000 rows, while the other was a ~4,100 row random subset of the larger set. I opted for the smaller dataset because it provides enough observations for modeling while also reducing overall computational time.

#Importing the necessary libraries
import requests
import io
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels
import numpy as np
import seaborn as sns
import xgboost as xgb
import matplotlib.patches as mpatches
from zipfile import ZipFile
from statsmodels.distributions.empirical_distribution import ECDF
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn import metrics
from imblearn.over_sampling import SMOTE
from sklearn.metrics import fbeta_score, make_scorer
from statsmodels.stats.outliers_influence import variance_inflation_factor  
from IPython.display import display, HTML

Functions¶

These are additional functions I created or borrowed for the purposes of this analysis.

#Creating a function to download, read, and extract zipfile contents. Note that this downloads into memory and into the 
#same directory as this notebook: data_reader
def data_reader(url, output_file_name):
    '''Download and extract zip files into local script directory. 
    
    Keyword Arguments: 
    url -- string representing url path or location
    output_file_name -- folder name into which contents of the zip will be written
    
    Note: If zip file contains folders when extracted, they will exist below the level of output_file_name folder.
    '''
    my_file = requests.get(url)
    output = open(output_file_name, "wb")
    output.write(my_file.content)
    output.close()
    with ZipFile(output_file_name, "r") as zip:
        zip.printdir()
        zip.extractall()

#Creating a convenient function to plot empirical cumulative distribution plots for continuous
#variables
def ecdf_plot(data, variable, x_lab, display_lines = True):
    '''Plot empirical cumulative distribution function of a numerical variable
    and plot mean (red) and median (green) lines
    
    Keyword Arguments: 
    data -- pandas Dataframe 
    variable -- column name (string) **must be a numerical input**
    x_lab -- X axis label (string)
    display_lines -- True or False 
    
    '''
    sns.set()
    ecdf = ECDF(data[variable])
    _= plt.plot(ecdf.x, ecdf.y, marker = ".", linestyle = "none", alpha = 0.2)
    _= plt.xlabel(x_lab)
    _= plt.ylabel("Cumulative Density")
    if (display_lines == True):
        _= plt.vlines(np.mean(data[variable]), ymax=1,ymin=0,colors='r')
        _= plt.vlines(np.median(data[variable]), ymax = 1, ymin = 0, colors = 'g')

#Creating a convenient function to plot my bar graphs in a consistent way
def mybarplot(data, variable, x_lab, x_tick_rotation = 360, reorder = False, correct_index = 0):
    '''Plot standardized bar graphs 
    
    Keyword Arguments:
    data -- pandas Dataframe
    variable -- column name (string) **Must be a categorical input**
    x_lab -- X axis label (string)
    x_tick_rotation -- rotation of the x-axis labels (numeric)
    reorder -- Whether or not you want to reorder the bars on the graph (bool)
    correct_index -- A list of strings representing the levels of the categorical variable
                     (string or list of strings)
    '''
    sns.set()
    temp = data[variable].value_counts().to_frame()
    if (reorder == True):
        temp = temp.reindex(correct_index)
    temp = temp.reset_index()
    _= temp.plot.bar(x = "index", y = variable, color = "mediumseagreen", edgecolor = "black")
    _= plt.ylabel("Count")
    _= plt.xlabel(x_lab)
    _= plt.xticks(rotation = x_tick_rotation)

#Decision cutoff report generator
def threshold_optimization(y_probas, y_real, \
                           decision_thresholds = [x / 10.0 for x in range(1, 10, 1)]):
    '''Report confusion matrix and model metrics for probability decision cutoffs
    
    Keyword Arguments: 
    y_probas -- list of predicted probabilities for the positive class (np array)
    y_real -- true class labels of the validation or test data (dataframe)
    decision_thresholds -- positive class probability at which the model will label an observation
                           a 1 (numeric or list of numeric)
    '''
    class_labels = pd.DataFrame(index=decision_thresholds, columns=range(len(y_probas)))

    for thresh in decision_thresholds:
        temp = [1 if x >= thresh else 0 for x in y_probas]
        class_labels.loc[thresh,:] = temp

    for i in range(len(class_labels)):
        print("-----------------------------Decision Cutoff:",class_labels.index[i])
        print(pd.DataFrame(
        metrics.confusion_matrix(y_real, np.int64(class_labels.iloc[i,:])),
                           columns=['pred_neg', 'pred_pos'], 
                           index=['neg', 'pos']))
        print(metrics.classification_report(y_real,np.int64(class_labels.iloc[i,:])))

#Written by Stackoverflow user Max Ghenis
def calculate_vif_(X, thresh=5.0):
    variables = list(range(X.shape[1]))
    dropped = True
    while dropped:
        dropped = False
        vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
               for ix in range(X.iloc[:, variables].shape[1])]

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
                  '\' at index: ' + str(maxloc))
            del variables[maxloc]
            dropped = True

    #print('Remaining variables:')
    #print(X.columns[variables])
    return X.iloc[:, variables]

Data Acquisition¶

#Getting the url from the repository for the additional datasets
url_additional = "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"

#Using the data reader function to get the additional bank data and load it onto the same 
#directory as this notebook
data_reader(url= url_additional, output_file_name="bank_additional.zip")

File Name                                             Modified             Size
bank-additional/                               2014-03-26 11:28:00            0
bank-additional/.DS_Store                      2014-03-25 10:52:16         6148
__MACOSX/                                      2014-03-26 11:28:12            0
__MACOSX/bank-additional/                      2014-03-26 11:28:12            0
__MACOSX/bank-additional/._.DS_Store           2014-03-25 10:52:16           82
bank-additional/.Rhistory                      2014-03-25 16:27:14         3943
bank-additional/bank-additional-full.csv       2014-03-26 11:22:30      5834924
bank-additional/bank-additional-names.txt      2014-03-26 11:27:36         5458
bank-additional/bank-additional.csv            2014-03-26 11:23:34       583898
__MACOSX/._bank-additional                     2014-03-26 11:28:00          205

#Reading in the smaller of the two "additional" datasets: bank_additional
bank_additional = pd.read_csv("bank-additional/bank-additional.csv", delimiter = ";")

Exploratory Data Analysis¶

Preliminary Data Check¶

Taking a quick look at the data.

#Get the quick look at the data
bank_additional.head()

#Getting rid of any duplicates
bank_additional = bank_additional.drop_duplicates(keep="first")

#Getting dataset information
bank_additional.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4119 entries, 0 to 4118
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4119 non-null   int64  
 1   job             4119 non-null   object 
 2   marital         4119 non-null   object 
 3   education       4119 non-null   object 
 4   default         4119 non-null   object 
 5   housing         4119 non-null   object 
 6   loan            4119 non-null   object 
 7   contact         4119 non-null   object 
 8   month           4119 non-null   object 
 9   day_of_week     4119 non-null   object 
 10  duration        4119 non-null   int64  
 11  campaign        4119 non-null   int64  
 12  pdays           4119 non-null   int64  
 13  previous        4119 non-null   int64  
 14  poutcome        4119 non-null   object 
 15  emp.var.rate    4119 non-null   float64
 16  cons.price.idx  4119 non-null   float64
 17  cons.conf.idx   4119 non-null   float64
 18  euribor3m       4119 non-null   float64
 19  nr.employed     4119 non-null   float64
 20  y               4119 non-null   object 
dtypes: float64(5), int64(5), object(11)
memory usage: 708.0+ KB

#Number of unique values for each variable
bank_additional.nunique()

age                67
job                12
marital             4
education           8
default             3
housing             3
loan                3
contact             2
month              10
day_of_week         5
duration          828
campaign           25
pdays              21
previous            7
poutcome            3
emp.var.rate       10
cons.price.idx     26
cons.conf.idx      26
euribor3m         234
nr.employed        11
y                   2
dtype: int64

Variables¶

age: age of the client
job: occupation of the client
martial: martial status of the client
education: educational attainment of the client
default: if the client has ever defaulted before
housing: if the client has taken out a housing loan
loan: if the client has taken out a personal loan
contact: bank contact medium with client
month: month of year of last contact
day_of_week: day of week of last contact
duration: duration of last contact in seconds
campaign: number of contacts during this specific campaign
pdays: number of days since last contact on previous campaign
previous: number of contacts prior to this campaign
poutcome: outcome of previous marketing campaign
emp.var.rate: quarterly employment variation rate, which measures variation in hiring/firing in the economic climate
cons.price.idx: consumer price index, a summary measure of prices of basic goods
cons.conf.idx: a measure of consumer opinion about the economy
euribor3m: the 3 month Euro Interbank Offered Rate
nr.employed: the number of employees in the bank
y: the target variable, whether or not the customer has subscribed to a term deposit

General Statistics¶

Univariate Analysis¶

Before exploring relationships between different features of the data, let's take a look at them individually. Our goal here is to get an understanding of center and spread for numerical variables, and class distribution for categorical variables.

#Getting descriptives for each numeric variable 
bank_additional.describe()

The plot below shows the empirical cumulative density function (ECDF) plot for the age variable. The mean (40.1 years) is indicated by the red vertical line and the median (38 years) by the green vertical line. From this plot, it is apparent that age is right skewed. You can also see that ~80% of customers are between 30 and 60 years of age, and that ~40% of customers are between 30 and 40 years of age.

#Calling the ecdf_plot function to plot empirical cumulative distribution function
ecdf_plot(data=bank_additional, variable="age", x_lab="Age of Client")

Another look at age using a density histogram overlaid plot. Here we can see the distribution is right skewed and somewhat bell-shaped.

#Density plot and histogram of age
_= sns.distplot(bank_additional["age"],hist=True,color='orange')
_= plt.xlabel("Age of Client")
_= plt.ylabel("Density")

The plot below is the ECDF plot for duration. You can see that in almost all cases of contact with customers, duration remains below 1000 seconds. Additionally, ~90% of calls last less than 500 seconds. The mean duration is 256.8 seconds and the median is 181 seconds.

ecdf_plot(data=bank_additional, variable="duration", x_lab="Duration of Last Contact (seconds)")

The ECDF plot below shows the empirical cumulative density function plot for campaign. Based on the figure, you can see that campaign is right skewed. Just over 40% of customers were contacted once for the current marketing campaign, while ~90% were contacted 5 times or less.

ecdf_plot(data=bank_additional, variable="campaign",\
          x_lab="Number of Contact Attempts",display_lines=False)

Based on the ECDF plot below of pdays, it is evident that the vast majority of observations (over 95%) represent first time contact with clients, shown by the line drawn at 999 days since last contact.

ecdf_plot(data=bank_additional, variable="pdays",\
          x_lab="Number of Days Since Last Contact",display_lines=False)

The following density histogram plot shows the distribution of the little blip in pdays in the previous plot (bottom left corner). The following plot illustrates the distribution of pdays for those customers that had been contacted in the past (non-999 value). The distribution seems to be bimodal.

bank_additional_realpdays = bank_additional.copy()
bank_additional_realpdays = bank_additional[bank_additional.pdays != 999]

_= sns.distplot(bank_additional_realpdays["pdays"],hist=True, kde=True, color='orange')
_= plt.xlabel("Number of Days Since Last Contact")
_= plt.ylabel("Density")

The plot below shows the ECDF of previous. The majority of clients have not been contacted for previous campaigns.

ecdf_plot(data=bank_additional, variable="previous",\
          x_lab="Number of Previous Contact Attempts",display_lines=False)

The following ECDF plot of emp.var.rate shows that the rate takes on a few values, with most density lying at around -1.8 and above 1.

ecdf_plot(data=bank_additional, variable="emp.var.rate",\
          x_lab="Employment Variation Rate", display_lines=False)

The following ECDF plot of cons.price.idx shows little variation. Consumer Price Index varies between ~92 and ~95 for all values in the dataset, this variable may not prove to be very informative for that reason.

ecdf_plot(data=bank_additional, variable="cons.price.idx",\
          x_lab="Consumer Price Index Value",display_lines=False)

The following ECDF plot of cons.conf.idx shows that most observations are clustered around -42 and -37, indicating a generally negative view of the economy.

ecdf_plot(data=bank_additional, variable="cons.conf.idx",\
          x_lab="Consumer Confidence Index Value",display_lines=False)

The following ECDF plot of euribor3month shows that the rate is strongly bi-modal, with a lot of values around 1-1.5% and ~5%.

ecdf_plot(data=bank_additional, variable="euribor3m",\
          x_lab="Euribor 3 Month Rate", display_lines=False)

The following ECDF plot of nr.employed shows that ~40% of clients interacted with the bank when it had over 5200 employees.

ecdf_plot(data=bank_additional, variable="nr.employed",\
          x_lab="Number of Employees",display_lines=False)

The following plot is a bar graph of the frequencies of the different values of poutcome (non-existent, failure, success). Unsuprisingly, nonexistent dominates this attribute, as most customers in the dataset are considered new customers. This is congruent with 0 being dominant in previous (representing a customer previously contacted 0 times) and 999 being dominant in pdays (representing a customer who's never been contacted at all in their days with the bank). These new customers have never had the option to accept or deny a product.

#Calling the mybarplot function to graph poutcome
mybarplot(data=bank_additional, variable="poutcome",\
          x_lab="Outcome of Previous Marketing Campaign", reorder=True,\
          correct_index=["failure", "success", "nonexistent"])

The following bar graph shows how job is broken down across the dataset. Administrative, blue-collar, and technician occupations are the most common in the data.

mybarplot(data=bank_additional, variable="job", x_lab="Occupation", x_tick_rotation=90)

The following bar graph shows how marital is broken down across the dataset. Married clients make up over half of the observations.

mybarplot(data=bank_additional, variable="marital", x_lab="Marital Status")

The following bar graph shows the frequencies of different levels of education. The most common educational attainment is a university degree with over 1200 observations, but it does not make up a majority of clients.

mybarplot(data=bank_additional, variable="education", x_lab="Educational Attainment",\
          x_tick_rotation=90)

The bar graph of default shown below illustrates that most customers do not have any credit in default. A significant minority of the observations, however, are listed as unknown, which may prove to be interesting with further analysis.

mybarplot(data=bank_additional, variable="default", x_lab="Credit Default Status", reorder=True,\
          correct_index=["no", "yes", "unknown"])

The bar graph of housing shown below indicates that clients are somewhat evenly split between having or not having a housing loan.

mybarplot(data=bank_additional, variable="housing", \
          x_lab="Housing Loan Possession", reorder=True,\
          correct_index=["no", "yes", "unknown"])

The bar graph below of loan shows that most customers do not have a personal loan.

mybarplot(data=bank_additional, variable="loan", x_lab="Personal Loan Possession")

The bar graph below of contact shows that more clients are contacted via cell phone, but a strong minority of clients stick to telephone interactions.

mybarplot(data=bank_additional, variable="contact", x_lab="Contact Medium")

The bar graph of day_of_week, shown below, indicates an even spread of client contact over the course of the work week.

mybarplot(data=bank_additional, variable="day_of_week",\
          x_lab="Day of Week of Last Contact",reorder=True,\
          correct_index=["mon","tue","wed", "thu", "fri"])

The bar graph below showing the breakdown of month. Only 10 months are shown here because no clients were last contacted in the months of January or February. Additionally, the months of March, September, October, and December have relatively few observations. May has significantly more records of client contact than any other month.

mybarplot(data=bank_additional, variable="month", x_lab="Month of Last Contact", reorder=True,\
          correct_index=["mar","apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"])

The barplot below shows the breakdown of y, our target variable. We see that 3,668 clients did not sign up for the term deposit subscription and 451 did (89.05%-10.95% split).

#Bar plot of the target variable y
mybarplot(data=bank_additional, variable="y", x_lab="Target Variable")

#Table with frequencies and percentages of levels of y: new_table
new_table = bank_additional["y"].value_counts().to_frame()
new_table["y_percentage"] = [round(float(new_table.loc["no"]/sum(new_table["y"])),4)*100, \
                             round(float(new_table.loc["yes"]/sum(new_table["y"])),4)*100]
new_table

Multivariate Analysis¶

Now that we have a thorough understanding of all of the features present in our dataset, we can move onto more sophisticated visuals involving multiple variables.

The visual below is a heatmap of the correlation matrix of all numerical features in the data. Reddish tones indicate positive correlations, while blueish tones indicate negative correlation. Note that euribor3, emp.var.rate, and nr.employed are all very strongly correlated with each other, mostly because they are all linked to the same thing: macroeconomic context.

#Getting correlation matrix of numerical variables: corr
corr = bank_additional.corr()

#Plotting correlation matrix heatmap for visualization
_= sns.heatmap(corr,  vmax=1, vmin = -1, center=0,\
               square=True, linewidths=.5, cbar_kws={"shrink": .5},\
               annot=True, fmt='.2f', cmap='coolwarm')

plt.xticks(rotation = 45)
sns.despine()
_.figure.set_size_inches(14,10)
    
plt.show()

Below is a boxplot of age broken down by job. Most occupations fall into similar age ranges with the exceptions of retirees, who tend to be older, and students, who tend to be younger.

#Plotting boxplot of client age vs job
_= sns.boxplot(x="job", y="age",data=bank_additional)
_= plt.xlabel("Occupation")
_= plt.ylabel("Age of Client")
_= plt.xticks(rotation = 90)

Below is a boxplot of age broken down by default. Those with unknown credit in default are slightly older than those with no record of credit default.

#Plotting boxplot of client age vs default
_= sns.boxplot(x="default", y="age",data=bank_additional)
_= plt.xlabel("Default Status")
_= plt.ylabel("Age of Client")

Below is a stacked proportions chart illustrating the breakdown of y over the values of month. The months of March, September, October, and December have relatively high proportions of clients signing up for the term deposit subscription. It is important to note that those months have the smallest numbers of observations as well, so that could be contributing to these relatively high proportions.

#Creating proportion plot of target variable and month
temp = pd.crosstab(bank_additional["month"], bank_additional["y"])
temp["sum"] = temp.sum(axis=1)
temp.loc[:,"no":"yes"] = temp.loc[:,"no":"yes"].div(temp["sum"], axis=0)
temp = temp.drop(["sum"], axis = 1)
temp = temp.reindex(["mar","apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"])
temp
_= temp.plot(kind = "bar", stacked = True)
_= plt.xticks(rotation = 360)
_= plt.xlabel("Month")
_= plt.ylabel("Proportion")

The proportion plot below shows the breakdown of poutcome over the levels of the target variable, y. You can see that clients who were classed as "success" in previous marketing campaigns make up a higher proportion of those who signed up for the term deposit subscription on this marketing campaign.

#Creating proportion plot of target variable and poutcome
temp = pd.crosstab(bank_additional["y"], bank_additional["poutcome"])
temp["sum"] = temp.sum(axis=1)
temp.loc[:,"failure":"success"] = temp.loc[:,"failure":"success"].div(temp["sum"], axis=0)
temp = temp.drop(["sum"], axis = 1)
_= temp.plot(kind = "bar", stacked = True)
_= plt.xticks(rotation = 360)
_= plt.xlabel("Target Variable")
_= plt.ylabel("Proportion")

As mentioned by the publishers of the data, the duration variable cannot be used for predictive modeling. This is because someone who either does not pick up the phone (duration = 0 seconds) or hangs up too quickly (duration is small) will obviously not have signed up for the term deposit (y = "no"). Additionally, it is not possible to know the duration of a call until it is over, at whichpoint the outcome is known. With that being said, this boxplot below showing call lengths broken down by the target variable shows that the "yes" group stay on the phone for longer with bank employees. Even though this variable cannot be used in the modeling process, the bank can still inform its employees that if they manage to get a client to pick up the phone, they should aim to keep the conversation going for as long as possible.

#Plotting boxplot of call duration vs target variable
_= sns.boxplot(x="y", y="duration",data=bank_additional)
_= plt.xlabel("Target Variable")
_= plt.ylabel("Duration of Last Contact")

Modeling¶

Data Pre-Processing¶

The data as is will not work for modeling. We need to clean and posture it by encoding and dropping variables, splitting it, and oversampling the training set to overcome the target class imbalance.

full_data = bank_additional.copy()

The first step is to properly encode our target variable in a way that classification algorithms can work with. Here, we create a binary indicator target variable called target in place of the y variable.

#Encoding the target variable (y) properly with numeric 1,0
full_data['target'] = [0 if x =='no' else 1 for x in full_data['y']]
full_data = full_data.drop(["y"], axis= 1)

As mentioned previously duration cannot be a part of the modeling process, so we must drop it from our dataset.

#Drop duration
full_data = full_data.drop(["duration"], axis=1)

The pdays variable is challenging to deal with in its current form because of the imputation method used. It represents a numerical quantity, but 999 is used for missing values. I have elected to replace all 999s in the original pdays column with the mean value of the remaining non-999 values. Additionally, I will create binary indicator, pdays_missing, to denote those observations that originally had a 999 for pdays.

#Re-encoding the pdays variable
full_data["pdays_missing"] = [1 if x == 999 else 0 for x in full_data['pdays']]

pdays_mean = np.mean(full_data[full_data.pdays != 999].pdays)
full_data = full_data.replace(999, pdays_mean)

Now we need to scale all of the numerical attributes to reduce the potential for unequal effects. For example, emp.var.rate operates on the 1s scale, while nr.employed operates in the 1000s.

#Scaling all numerical columns using sklearn Scaler 
scaler = StandardScaler() 
numerical_vars_list = ["age", "pdays","previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx",\
                       "euribor3m", "nr.employed"]

#Iterating through and scaling each member of the numerical_vars_list
for var in numerical_vars_list:
    temp = scaler.fit_transform(full_data[var].to_frame())
    full_data[var] = temp

Below is a heatmap of all of the existing (now scaled) numerical features and the newly encoded binary features, including the newly encoded target. There are some extremely strong correlations, such as that between euribor3m and emp.var.rate.

#Correlation heatmap for the standardized and enumerated features (including target)
corr = full_data.corr()

_= sns.heatmap(corr,  vmax=1, vmin = -1, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True, fmt='.2f', cmap='coolwarm')
plt.xticks(rotation = 45)
sns.despine()
_.figure.set_size_inches(14,10)
    
plt.show()

Now that we have encoded and scaled our inputs, we have to create dummy variables for the categorical variables.

#Get dummy variables
full_data = pd.get_dummies(full_data)

Some of the resulting dummy variables contribute the same information, so here I am using the calculatevif function (written by Max Ghenis) to take out those variables with a variance inflation factor of 5 or higher.

full_data = calculate_vif_(full_data)

C:\Users\USER\Anaconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)

dropping 'job_admin.' at index: 11
dropping 'marital_divorced' at index: 22
dropping 'education_basic.4y' at index: 25
dropping 'default_no' at index: 32
dropping 'housing_no' at index: 34
dropping 'housing_unknown' at index: 34
dropping 'loan_no' at index: 35
dropping 'contact_cellular' at index: 37
dropping 'month_apr' at index: 38
dropping 'day_of_week_fri' at index: 47
dropping 'poutcome_nonexistent' at index: 52
dropping 'emp.var.rate' at index: 4
dropping 'euribor3m' at index: 6
dropping 'pdays_missing' at index: 8
dropping 'month_may' at index: 40
dropping 'marital_married' at index: 19

Now we have to split the data into train, test and validation. I opted for 80% training, 20% validation, and 20% test.

#Splitting data into predictors and labels
X = full_data.loc[:, full_data.columns != 'target']
y = full_data.loc[:, full_data.columns == 'target']

#Splitting into test train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\
                                                    random_state=1)

#Splitting into train validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,\
                                                    random_state=1)

A major concern regarding this classification problem is the imbalance between the majority and minority class. About 90% of clients did not sign up for the term deposit (target = 0), while only 10% did (target = 1). To correct for this, I used synthetic minority oversampling (SMOTE) to balance the training data with a 50-50 split between the 0s and the 1s.

#Synthetic minority oversampling on training data only
sm = SMOTE(random_state=1)
X_train, y_train = sm.fit_resample(X_train, y_train)

Logistic Regression¶

The code below takes us through the logistic regression model I ran. In the first part of the code, I am instantiating the logistic regression estimator (with a higher maximum iterations due to convergence concerns), defining the hyperparameters and scoring metric for gridsearch. I then use the optimized logistic regression model to predict the validation data and display those performance metrics below. I have also included performance metrics for several decision cutoffs between 0.3 and 0.75, to allow decision makers even more flexibility in their decision with this particular model.

#Creating an estimator
logistic_model = LogisticRegression(random_state=1,max_iter=1000)

#Creating hyperparameter grid
C = np.logspace(-4, 1, num=5,base=10)
hyperparameters = dict(C=C)

#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)

#Conducting the gridsearch with 3 fold cross validation
grid = GridSearchCV(logistic_model, hyperparameters, cv=3, verbose=0, scoring=alternate_scorer)
best_logistic_model = grid.fit(X_train, np.ravel(y_train))
#print('Best C:', best_logistic_model.best_estimator_.get_params()['C'])

#Getting predicted probabilites out of the tuned logistic model
y_scores_logistic = best_logistic_model.predict(X_val)

#Output confusion matrix and score report for logistic regression
print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_logistic),\
                   columns=['pred_neg', 'pred_pos'],\
                   index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_logistic))

     pred_neg  pred_pos
neg       669        65
pos        54        36
              precision    recall  f1-score   support

           0       0.93      0.91      0.92       734
           1       0.36      0.40      0.38        90

    accuracy                           0.86       824
   macro avg       0.64      0.66      0.65       824
weighted avg       0.86      0.86      0.86       824

Below is the output for different decision thresholds. Using this output, the bank can make a decision as to what balance they want to strike with calling too many people who will say no and minimzing losses from people who would have said yes had they been called.

y_logprobs = best_logistic_model.predict_proba(X_val)[:,1]

#Using threshold_optimization function (check Functions section) to generate report of probability cutoffs 
#and their associated metrics
threshold_optimization(y_probas=y_logprobs, y_real=y_val,\
                       decision_thresholds = [x/100.0 for x in range(30,80,5)])

-----------------------------Decision Cutoff: 0.3
     pred_neg  pred_pos
neg       580       154
pos        42        48
              precision    recall  f1-score   support

           0       0.93      0.79      0.86       734
           1       0.24      0.53      0.33        90

    accuracy                           0.76       824
   macro avg       0.59      0.66      0.59       824
weighted avg       0.86      0.76      0.80       824

-----------------------------Decision Cutoff: 0.35
     pred_neg  pred_pos
neg       613       121
pos        43        47
              precision    recall  f1-score   support

           0       0.93      0.84      0.88       734
           1       0.28      0.52      0.36        90

    accuracy                           0.80       824
   macro avg       0.61      0.68      0.62       824
weighted avg       0.86      0.80      0.83       824

-----------------------------Decision Cutoff: 0.4
     pred_neg  pred_pos
neg       630       104
pos        48        42
              precision    recall  f1-score   support

           0       0.93      0.86      0.89       734
           1       0.29      0.47      0.36        90

    accuracy                           0.82       824
   macro avg       0.61      0.66      0.62       824
weighted avg       0.86      0.82      0.83       824

-----------------------------Decision Cutoff: 0.45
     pred_neg  pred_pos
neg       653        81
pos        50        40
              precision    recall  f1-score   support

           0       0.93      0.89      0.91       734
           1       0.33      0.44      0.38        90

    accuracy                           0.84       824
   macro avg       0.63      0.67      0.64       824
weighted avg       0.86      0.84      0.85       824

-----------------------------Decision Cutoff: 0.5
     pred_neg  pred_pos
neg       669        65
pos        54        36
              precision    recall  f1-score   support

           0       0.93      0.91      0.92       734
           1       0.36      0.40      0.38        90

    accuracy                           0.86       824
   macro avg       0.64      0.66      0.65       824
weighted avg       0.86      0.86      0.86       824

-----------------------------Decision Cutoff: 0.55
     pred_neg  pred_pos
neg       680        54
pos        56        34
              precision    recall  f1-score   support

           0       0.92      0.93      0.93       734
           1       0.39      0.38      0.38        90

    accuracy                           0.87       824
   macro avg       0.66      0.65      0.65       824
weighted avg       0.87      0.87      0.87       824

-----------------------------Decision Cutoff: 0.6
     pred_neg  pred_pos
neg       691        43
pos        56        34
              precision    recall  f1-score   support

           0       0.93      0.94      0.93       734
           1       0.44      0.38      0.41        90

    accuracy                           0.88       824
   macro avg       0.68      0.66      0.67       824
weighted avg       0.87      0.88      0.88       824

-----------------------------Decision Cutoff: 0.65
     pred_neg  pred_pos
neg       699        35
pos        59        31
              precision    recall  f1-score   support

           0       0.92      0.95      0.94       734
           1       0.47      0.34      0.40        90

    accuracy                           0.89       824
   macro avg       0.70      0.65      0.67       824
weighted avg       0.87      0.89      0.88       824

-----------------------------Decision Cutoff: 0.7
     pred_neg  pred_pos
neg       712        22
pos        60        30
              precision    recall  f1-score   support

           0       0.92      0.97      0.95       734
           1       0.58      0.33      0.42        90

    accuracy                           0.90       824
   macro avg       0.75      0.65      0.68       824
weighted avg       0.88      0.90      0.89       824

-----------------------------Decision Cutoff: 0.75
     pred_neg  pred_pos
neg       716        18
pos        65        25
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       734
           1       0.58      0.28      0.38        90

    accuracy                           0.90       824
   macro avg       0.75      0.63      0.66       824
weighted avg       0.88      0.90      0.88       824

In the code below, I get the logistic regression coefficients from the model. These coefficients have the benefit of interpretability, unlike the other models in this notebook.

coefficients = pd.concat([pd.DataFrame(X_train.columns),\
                          pd.DataFrame(np.transpose(best_logistic_model.best_estimator_.coef_))],\
                          axis = 1)
coefficients.columns = ["Feature", "Coefficient"]
coefficients.head()

XGBoost¶

The code below takes us through the XGBoost model I ran. Much like how the logistic regression code is structured, I define the model, hyperparameters to try, and the scoring method for gridsearch. I run the optimized model on the validation data and output the performance metrics below. Note that XGBoost has a decision threshold (base_score) already built in as a parameter, so we will not have to use the threshold_optimization function from logistic.

xgboost_model = xgb.XGBClassifier(random_state=1)

#Creating hyperparameter grid
base_score = [x/10.0 for x in range(3,7,1)]
n_estimators = [10,100,1000]
max_depth = range(1,10,3)
learning_rate = [0.01, 0.1]
min_child_weight = [1,2]
hyperparameters = dict(n_estimators=n_estimators, max_depth = max_depth,\
                       learning_rate = learning_rate, min_child_weight = min_child_weight)

#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)

#Creating grid object
grid = GridSearchCV(xgboost_model, hyperparameters, cv=3, verbose=0, scoring= alternate_scorer)

#Fitting grid object xgboost model with training data to select best model
best_xgboost_model = grid.fit(X_train, np.ravel(y_train))

#Getting predicted probabilites out of the tuned xgboost model
y_scores_xgboost = best_xgboost_model.predict(X_val)

#Output confusion matrix and score report for xgboost
print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_xgboost),\
                   columns=['pred_neg', 'pred_pos'],\
                   index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_xgboost))

     pred_neg  pred_pos
neg       691        43
pos        59        31
              precision    recall  f1-score   support

           0       0.92      0.94      0.93       734
           1       0.42      0.34      0.38        90

    accuracy                           0.88       824
   macro avg       0.67      0.64      0.65       824
weighted avg       0.87      0.88      0.87       824

Random Forest¶

The code below takes us through the random forest model. First I instantiate the classifier, then specify hyperparameters and conduct a gridsearch. I run the model on the validation data and output that performance below.

random_forest_model = RandomForestClassifier(random_state=1)

#Creating hyperparameter grid
n_estimators = [10,100,1000]
max_features = range(1,20,3)#Change up to 20 in the next iteration once the rest is figured out
max_depth = [10,100]
min_samples_split = [2,5,10]
min_samples_leaf = [1,2,4]
bootstrap = [True, False]
hyperparameters = dict(n_estimators=n_estimators, max_features = max_features,\
                       max_depth = max_depth, min_samples_split = min_samples_split,\
                       min_samples_leaf = min_samples_leaf, bootstrap = bootstrap)

#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)

#Creating grid object
grid = GridSearchCV(random_forest_model, hyperparameters, cv=3, verbose=0,\
                    scoring= alternate_scorer)

#Fitting grid object xgboost model with training data to select best model
best_rf_model = grid.fit(X_train, np.ravel(y_train))

#Getting predictions out of the tuned xgboost model
y_scores_rf = best_rf_model.predict(X_val)

print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_rf),\
                   columns=['pred_neg', 'pred_pos'],\
                   index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_rf))

     pred_neg  pred_pos
neg       679        55
pos        53        37
              precision    recall  f1-score   support

           0       0.93      0.93      0.93       734
           1       0.40      0.41      0.41        90

    accuracy                           0.87       824
   macro avg       0.66      0.67      0.67       824
weighted avg       0.87      0.87      0.87       824

Support Vector Machine¶

The code takes us through the support vector machine model. Like the other models, I instantiate the classifier, specify hyperparameters, run a gridsearch, and predict on the validation and display the results. This model is usually reserved for data with a very large number of features and a relatively small number of observations, but I thought I would give it a try just to see how it compares to the other models.

support_vector_machine_model = SVC(random_state=1, probability=True)

#Creating hyperparameter grid
C = np.logspace(-4, 1, num=5,base=10)
kernel = ["rbf", "poly"]
gamma = ["scale", "auto"]
hyperparameters = dict(C = C, kernel = kernel, gamma = gamma)

#Creating an fbeta score to select more towards reducing false negatives
alternate_scorer = make_scorer(fbeta_score, beta=2)

#Creating grid object
grid = GridSearchCV(support_vector_machine_model, hyperparameters, cv=3, verbose=0,\
                    scoring= alternate_scorer)

#Fitting grid object svm model with training data to select best model
best_svm_model = grid.fit(X_train, np.ravel(y_train))

#Getting predictions out of the tuned xgboost model
y_scores_svm = best_svm_model.predict(X_val)

print(pd.DataFrame(metrics.confusion_matrix(y_val, y_scores_svm),\
                   columns=['pred_neg', 'pred_pos'],\
                   index=['neg', 'pos']))
print(metrics.classification_report(y_val, y_scores_svm))

     pred_neg  pred_pos
neg       658        76
pos        54        36
              precision    recall  f1-score   support

           0       0.92      0.90      0.91       734
           1       0.32      0.40      0.36        90

    accuracy                           0.84       824
   macro avg       0.62      0.65      0.63       824
weighted avg       0.86      0.84      0.85       824

Comparison and Selection¶

Below we compare and select the models based on the area under the Precision Recall Curve (PRC). I recommend using the AUC of the PRC for this dataset because of the class imbalance present in the target variable. We examine the performance on test data and training data. These comparisons will tell us a few things: which model is best for the problem, how well they will likely perform on unseen data, and how much the models overfit the training data.

#Logistic regression predictions and probabilities on test data and train data
test_probs_logistic = best_logistic_model.predict_proba(X_test)[:,1]
train_probs_logistic = best_logistic_model.predict_proba(X_train)[:,1]

#XG Boost predictions and probabilities on test data and train data
test_probs_xgboost = best_xgboost_model.predict_proba(X_test)[:,1]
train_probs_xgboost = best_xgboost_model.predict_proba(X_train)[:,1]

#Random Forest predictions and probabilities on test data and train data
test_probs_rf = best_rf_model.predict_proba(X_test)[:,1]
train_probs_rf = best_rf_model.predict_proba(X_train)[:,1]

#Support Vector Machine predictions and probabilities on test data and train data
test_probs_svm = best_svm_model.predict_proba(X_test)[:,1]
train_probs_svm = best_svm_model.predict_proba(X_train)[:,1]

#Getting the precision and recall for each model on the test data
prec_lr, rec_lr, _=  metrics.precision_recall_curve(y_true=y_test,\
                                                           probas_pred= test_probs_logistic)

prec_xgb, rec_xgb, _= metrics.precision_recall_curve(y_true=y_test,\
                                                            probas_pred=test_probs_xgboost)

prec_rf, rec_rf, _= metrics.precision_recall_curve(y_true=y_test,\
                                                          probas_pred= test_probs_rf)

prec_svm, rec_svm, _= metrics.precision_recall_curve(y_true=y_test,\
                                                     probas_pred=test_probs_svm)

#Getting the precision and recall for each model on the train data
prec_lr_tr, rec_lr_tr, _=  metrics.precision_recall_curve(y_true=y_train,\
                                                           probas_pred= train_probs_logistic)

prec_xgb_tr, rec_xgb_tr, _= metrics.precision_recall_curve(y_true=y_train,\
                                                            probas_pred=train_probs_xgboost)

prec_rf_tr, rec_rf_tr, _= metrics.precision_recall_curve(y_true=y_train,\
                                                          probas_pred= train_probs_rf)

prec_svm_tr, rec_svm_tr, _= metrics.precision_recall_curve(y_true=y_train,\
                                                     probas_pred=train_probs_svm)

The plots below show each model's performance on the test data using the PRC plots. Based on the PRC score of 0.355, the XGBoost model is the best model.

_=plt.plot(rec_lr, prec_lr, color='C0')
_=plt.plot(rec_xgb, prec_xgb, color= 'C1')
_=plt.plot(rec_rf, prec_rf, color = 'C2')
_=plt.plot(rec_svm, prec_svm, color = 'C3')

#Tweaking some visualization characteristics, including adding the central line
_= plt.xlim([-0.01, 1.0])
_= plt.ylim([0.0, 1.05])
#_= plt.plot([0, 1], [0, 1], 'k--')
_= plt.xlabel('Recall')
_= plt.ylabel('Precision')
_= plt.title('Precision Recall Curve: Test Data')

#This is all legend formatting
blue_patch = mpatches.Patch(color='C0',\
                            label=('Logistic Regression: '\
                                   +str(round(metrics.auc(rec_lr,prec_lr),ndigits=3))))
orange_patch = mpatches.Patch(color='C1',\
                            label=('XG Boost: '\
                                   +str(round(metrics.auc(rec_xgb,prec_xgb),ndigits=3))))
green_patch = mpatches.Patch(color='C2',\
                            label=('Random Forest: '\
                                   +str(round(metrics.auc(rec_rf,prec_rf),ndigits=3))))
red_patch = mpatches.Patch(color='C3',\
                            label=('Support Vector Machine: '\
                                   +str(round(metrics.auc(rec_svm,prec_svm,),ndigits=3))))
plt.legend(handles=[blue_patch, orange_patch,green_patch, red_patch])

#Changing the figure size from the default
plt.rcParams["figure.figsize"] = [8,10]
plt.show()

The plot below shows the performance of each model on the training data. Based on the disparity between the test and train performance on the PRC, it is evident that the models I made seriously overfit the training data and did not generalize well to the test.

_=plt.plot(rec_lr_tr, prec_lr_tr, color='C0')
_=plt.plot(rec_xgb_tr, prec_xgb_tr, color= 'C1')
_=plt.plot(rec_rf_tr, prec_rf_tr, color = 'C2')
_=plt.plot(rec_svm_tr, prec_svm_tr, color = 'C3')

#Tweaking some visualization characteristics, including adding the central line
_= plt.xlim([-0.01, 1.01])
_= plt.ylim([0.0, 1.05])
#_= plt.plot([0, 1], [0, 1], 'k--')
_= plt.xlabel('Recall')
_= plt.ylabel('Precision')
_= plt.title('Precision Recall Curve: Training Data')

#This is all legend formatting
blue_patch = mpatches.Patch(color='C0',\
                            label=('Logistic Regression: '\
                                   +str(round(metrics.auc(rec_lr_tr,prec_lr_tr),ndigits=3))))
orange_patch = mpatches.Patch(color='C1',\
                            label=('XG Boost: '\
                                   +str(round(metrics.auc(rec_xgb_tr,prec_xgb_tr),ndigits=3))))
green_patch = mpatches.Patch(color='C2',\
                            label=('Random Forest: '\
                                   +str(round(metrics.auc(rec_rf_tr,prec_rf_tr),ndigits=3))))
red_patch = mpatches.Patch(color='C3',\
                            label=('Support Vector Machine: '\
                                   +str(round(metrics.auc(rec_svm_tr,prec_svm_tr),ndigits=3))))
plt.legend(handles=[blue_patch, orange_patch,green_patch, red_patch])

#Changing the figure size from the default
plt.rcParams["figure.figsize"] = [8,10]
plt.show()

The table below shows the top five feature importances for the XGBoost model and their corresponding logistic regression coefficients for some added interpretability.

#Creating a dataframe containing the features, random forest importance, and xgboost importance
fi = pd.concat([pd.DataFrame(X_train.columns),\
                          pd.DataFrame(np.transpose(best_xgboost_model.best_estimator_.feature_importances_))],\
                          axis = 1)
fi.columns = ["Feature","XGB Importance"]

#Creating a dataframe with a new column containing the logistic regression coefficient
result = pd.merge(fi,
                 coefficients["Coefficient"],
                 on=fi.Feature,
                 how='inner')
result = result.drop("key_0", axis = 1)

most_important_xgb = result.sort_values("XGB Importance", ascending = False)
most_important_xgb.head()

Conclusions¶

Suggestions for future modeling improvements:

1) Reconsider sampling strategy. SMOTE oversampling may have introduced additional noise that the models are overfitting to. Undersampling may be a viable alternative to try and reduce overfitting.

2) Reconsider hyperparameters , especially the depth parameter for XGBoost and Random Forest and regularization for Logistic Regression and SVM. Reducing the maximum depth/reducing the C parameter (inverse regularization parameter) may introduce bias and reduce variance of the models, thereby reducing overfitting.

3) Implement feature selection/feature engineering. Working with domain experts to categorize features into related business factors would potentially eliminate excessive variables from the modeling process. Additionally, input from those stakeholders could inform new features that better concentrate the signal from the data.

Here are the final recommendations to the bank from this analysis:

1) The bank should inform its employees to keep the conversation going for as long as possible. While it could not be used in the modeling process, the impact of duration cannot be ignored. Clients who subscribed to the term deposit generally stayed on the phone longer with bank employees.

2) Use the XGBoost model to predict client response and inform areas of importance in tandem with the logistic regression to guide strategy. Here is an example using the top variable (nr.employed) and its corresponding logistic regression coefficient:

The XGBoost model found nr.employed to be the most significant feature in determining whether or not a client will subscribe to a term deposit. Looking the logistic regression coefficient, a higher number of people employed by the bank is associated with a lower chance that said client will subscribe to the term deposit through direct marketing, holding all else equal. Hiring more employees will not improve term deposit subscriptions.

	age	duration	campaign	pdays	previous	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed
count	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000
mean	40.113620	256.788055	2.537266	960.422190	0.190337	0.084972	93.579704	-40.499102	3.621356	5166.481695
std	10.313362	254.703736	2.568159	191.922786	0.541788	1.563114	0.579349	4.594578	1.733591	73.667904
min	18.000000	0.000000	1.000000	0.000000	0.000000	-3.400000	92.201000	-50.800000	0.635000	4963.600000
25%	32.000000	103.000000	1.000000	999.000000	0.000000	-1.800000	93.075000	-42.700000	1.334000	5099.100000
50%	38.000000	181.000000	2.000000	999.000000	0.000000	1.100000	93.749000	-41.800000	4.857000	5191.000000
75%	47.000000	317.000000	3.000000	999.000000	0.000000	1.400000	93.994000	-36.400000	4.961000	5228.100000
max	88.000000	3643.000000	35.000000	999.000000	6.000000	1.400000	94.767000	-26.900000	5.045000	5228.100000

	Feature	XGB Importance	Coefficient
6	nr.employed	0.122314	-0.510285
10	job_management	0.041116	-1.565867
32	contact_telephone	0.039735	-2.381669
9	job_housemaid	0.036850	-2.368972
7	job_blue-collar	0.036077	-2.139406

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	30	blue-collar	married	basic.9y	no	yes	no	cellular	may	fri	...	2	999	nonexistent	-1.8	92.893	-46.2	1.313	5099.1	no
1	39	services	single	high.school	no	no	no	telephone	may	fri	...	4	999	nonexistent	1.1	93.994	-36.4	4.855	5191.0	no
2	25	services	married	high.school	no	yes	no	telephone	jun	wed	...	1	999	nonexistent	1.4	94.465	-41.8	4.962	5228.1	no
3	38	services	married	basic.9y	no	unknown	unknown	telephone	jun	fri	...	3	999	nonexistent	1.4	94.465	-41.8	4.959	5228.1	no
4	47	admin.	married	university.degree	no	yes	no	cellular	nov	mon	...	1	999	nonexistent	-0.1	93.200	-42.0	4.191	5195.8	no

	Feature	Coefficient
0	age	0.005049
1	campaign	-0.216472
2	pdays	-0.060140
3	previous	0.244109
4	cons.price.idx	0.447190

	y	y_percentage
no	3668	89.05
yes	451	10.95