Bank Direct Marketing System using Machine Learning

Aditya Kumar Pandey
14 min readJun 14, 2022

--

Direct marketing is the most effective marketing method to directly target the audience and measure the effective result. Before jumping to this case study let’s first understand the idea behind direct marketing.

Direct marketing is a technique that is used to contact the customers directly using e-mail, messages, calls, etc. to make the customer subscribe to the deposit. One of the best advantages of direct marketing is that you can reach your specific audience with personalized messages or calls so that they subscribe to the deposit. A good direct marketing idea is that a researcher should take the proper time to research the target audience that is most likely to convert.

Problem statement and Dataset

The aim of this project is to determine whether the customer will subscribe to the bank deposit or not. The dataset is publicly available for research. The details are described in [Moro et al., 2011]. The data set is publicly available on UCI Machine Learning directory. It is related to the direct marketing campaigns of the Portuguese banking institution. Since the size of the dataset is huge so I have used a sample of data that has 4521 rows and 17 columns.

The details about the features are given below:

  1. age (numeric)
  2. job: type of job (categorical: “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “ entrepreneur”, “student”,
    “blue-collar”, “ self-employed”, “retired”, “technician”, “services”)
  3. marital: marital status (categorical: “married”, “ divorced”, “single”; note: “divorced” means divorced or widowed)
  4. education (categorical: “unknown”, “secondary”, “primary”, “tertiary”)
  5. default: has credit in default? (binary: “yes”, “no”)
  6. balance: average yearly balance, in euros (numeric)
  7. housing: has a housing loan? (binary: “yes”, “no”)
  8. loan: has a personal loan? (binary: “yes”, “no”)
    # related with the last contact of the current campaign:
  9. contact: contact communication type (categorical: “unknown”, “telephone”, “cellular”)
  10. day: last contact day of the month (numeric)
  11. month: last contact month of the year (categorical: “Jan”, “Feb”, “mar”, …, “nov”, “dec”)
  12. duration: last contact duration, in seconds (numeric)
  13. campaign: number of contacts performed during this campaign and for this client (numeric, includes the last contact)
  14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  15. previous: number of contacts performed before this campaign and for this client (numeric)
  16. poutcome: outcome of the previous marketing campaign (categorical: “unknown”, “other”, “failure”, “success”)

Output variable (desired target):

17. y — has the client subscribed to a term deposit? (binary: “yes”, “no”)

Loading the dataset

The first step is to load the dataset and print its shape. Data has 4521 rows and 17 columns.

# load data set
df = pd.read_csv("bank.csv")
# display the first 5 rows of dataset
df.head()
df.shape

Exploratory Data Analysis (EDA)

EDA is a process of Analyzing and investigating the data to get insights about data and to dig deeper into the dataset. It helps in discovering the outliers and more meaningful patterns in the dataset.

I have performed EDA on the data to get more useful information like finding the description, checking for null values, etc.

The next step is to show the statistical description of the data like mean value, maximum and minimum values of the numerical features, etc., and also find null values and impute them if they exist.

# print the description of the dataset
print(df.describe())
print(df.isnull.sum())

The above results show the description and number of null values present in the data. It is very clear from the above result that the dataset has no null values present in it.

Data Visualization

Next, I have plotted the visualization for various features in the dataset. It helps to get more visual insights from the data.

I have plotted the distribution plot for the features like ‘balance’, ‘age’, and ‘duration’ to visualize its distribution.

# analysing balance features
sns.distplot(df['balance'])
plt.title('Distribution of balance')
plt.show()
# print skewness and kurtosis value of the data
#kurtosis is used to determined whether the data is heavily tailed or not. If kurtosis value is high then it is
#said to have highly tailed i.e. that is the data contain outliers.
print("skewness: %f" % df['balance'].skew())
print("Kurtosis: %f" % df['balance'].kurt())

You can see the distribution for the balance feature. It is highly left-skewed. It is very clear from the plot that the maximum balance is around 10000 and most of the distribution is between 0 to 10k.

The kurtosis value shows that the feature is highly tailed and it contains a heavy amount of outliers. Kurtosis defines the heaviness of the data.

The next analysis was done for the age feature. The distribution plot for age feature shows the below results.

# distribution of age
sns.distplot(df['age'])
plt.title("distribution of age")
plt.show()
# print skewness and kurtosis value of the data
print("skewness: %f" % df['age'].skew())
print("Kurtosis: %f" % df['age'].kurt())
  1. You can see that the age feature is not very skewed and is also not normally distributed.
  2. It may contain some outliers.
  3. The average age distribution is around 35 to 40.
  4. The kurtosis value is also very less which shows that the feature is not very tailed and the number of outliers is also less.

The next plot is for the distribution of the duration feature which tells what is the last contact duration, in seconds.

# distribution of duration
sns.distplot(df['duration'])
plt.title("distribution of duration")
plt.show()

Observations:

  1. It is left-skewed.
  2. The maximum duration for contact is more than 3000 seconds.
  3. The minimum duration is for around 1 second.

I have also plotted the scatter plot for balance features on the basis of target feature y which is whether the customer has subscribed or not.

# plot scatter plot for balance data
sns.scatterplot(x = df.index, y = df['balance'], hue = df['y'])
plt.title("Scatter plot for distribution of balance")
  1. The average balance amount is around 2000 or less.
  2. People who did not subscribe to the deposit are higher in number than those who did.
  3. There are some outliers also.

The next plot is the count plot for some categorical variable to represent the count of each category in the particular features. The below chart shows the count for the marital status of the customers.

# count plot for marital data
plt.figure(figsize = (10,6))
sns.countplot(df['marital'])
plt.xlabel(" Marital Status")
plt.ylabel("No. of counts")
plt.title("Count of different marital status")

The count of married people is maximum followed by single and then divorced.

The next plot shows the count for the job category on the basis of subscriptions made or not.

# plot for job category
plt.figure(figsize = (15,6))
sns.countplot(df['job'], hue = df['y'])
plt.xlabel(" Job Category")
plt.ylabel("No. of counts")
plt.title("Count of different job category")

People who are in management, blue-collar, and technical job are the one who has slightly better subscription as compared to the others.

There is a feature called campaign which records the data about the number of calls made to the customers. I have plotted the distribution plot for that and the following observations were made.

# plot for campaign category
plt.figure(figsize = (15,6))
sns.distplot(df['campaign'], kde = False)
plt.xlabel(" Campaign Category")
plt.ylabel("No. of counts")
plt.title("Count of Campaign made")
  1. The maximum number of calls that are done is between 1 to 2.
  2. The calls which made more than 10 are may by outliers.
  3. The average call rate is around 2.

Now let’s take a look at our target variables. I will plot the count plot for the target feature and see what are the observations. It will show the count for the customer who made a subscription (yes or no).

# plot for target variable
plt.figure(figsize = (10,5))
sns.countplot(df['y'])
plt.xlabel(" Subscription")
plt.ylabel("No. of counts")
plt.title("Count of target variable that is subscription")
  1. The count for subscriptions made is very less.
  2. This shows that the data is highly imbalanced.
  3. This data needs to balance for better results.

Removing Outliers

Outliers are the extremely high or extremely low data points relative to the other nearest data points. They are also considered an error in the data and may affect the model. Box plot is one of the best plots to observe the outliers.

I have plotted the box plot for the features balance, age, and duration and got the following result.

# boxplot for age
sns.boxplot(x = df['age'])
plt.title("Boxplot for Age Feture")
plt.xlabel("Age")
# box plot for duration
sns.boxplot(x = df['duration'] )
plt.title("Boxplot for duration Feture")
plt.xlabel("Duration")
# box pot for campaign
sns.boxplot(x = df['campaign'] )
plt.title("Boxplot for campaign Feture")
plt.xlabel("Campaign")

It can be seen from the plot that there is a huge number of outliers in these features. I will IQR technique to remove these outliers.

Removing outliers from features

I have used IQR for handling the outliers. For calculating IQR:

  1. Calculate the first quartile and the third quartile i.e. Q1 and Q2.
  2. Then, calculate IQR by Q3-Q1
  3. compute lower bound (Q1–1.5*IQR) and upper bound (Q3 + 1.5*IQR).
  4. Remove the points which lie outside the upper bound and lower bound.

Here I have calculated the IQR, upper bound, and lower bound of the balance of the feature, age, campaign, and duration and then removed values that lies outside these bounds.

Removing outliers from balance

Removing outliers from Age

Remove outliers from duration

Removing Outliers from campaign

Now, I have removed the outliers from the data and I will plot the box plot again to check whether the outliers are present or not. So, Let’s plot the box plot.

# box plot after removong outliers
sns.boxplot(x = df['duration'])
plt.title("Box plot of Duration after removing outliers")
plt.xlabel('DURATION')
# box plot after removong outliers
sns.boxplot(x = df['balance'])
plt.title("Box plot of balance after removing outliers")
plt.xlabel('BALANCE')
# box plot after removong outliers
sns.boxplot(x = df['age'])
plt.title("Box plot of age after removing outliers")
plt.xlabel('AGE')
# box plot after removong outliers
sns.boxplot(x = df['campaign'])
plt.title("Box plot of age after removing outliers")
plt.xlabel('CAMPAIGN')

From the plots, it is very clear that the outliers have been removed and there are still a few outliers that can be ignored. I will also check for the skewness of data after removing outliers. Let’s what we got after outlier treatment.

# analysing balance features
sns.distplot(df['balance'])
plt.title('Distribution of balance after removing outliers')
plt.xlabel("Balance")
plt.show()
# analysing duration features
sns.distplot(df['duration'])
plt.title('Distribution of duration after removing outliers')
plt.xlabel("Duration")
plt.show()

We can see that the feature balance and duration were highly left-skewed before but now it has improved very much. There huge difference between the plot with outliers and without outliers.

Now, that I have done with the outliers handling it is time to do some more data preprocessing. Some of the features in the dataset contain some unknown values. I will consider these unknown values as null values and replace them with the most occurring values in that features.

Let’s take a look at the values or categories in these features.

# print count of each value in particular feature columns
print(df['poutcome'].value_counts())
print("***********\n")
print(df['contact'].value_counts())
print("***********\n")
print(df['education'].value_counts())
print("***********\n")
print(df['job'].value_counts())

The above features have unknown values let’s impute them.

# Here I have considered all unknown values as a Null values.
# Replacing all unknown values with the values which occure most number of times.
# since the size of dataset is small and the count of unknown values are huge in each feature mentioned above.
# So we can not drop these unknown values because it will affect the size of data which can cause problem.
df['poutcome'] = df['poutcome'].replace(['unknown'], 'failure')
df['contact'] = df['contact'].replace(['unknown'], 'cellular')
df['education'] = df['education'].replace(['unknown'], 'secondary')
df['job'] = df['job'].replace(['unknown'], 'management')
# print the count of values after filling the unknown values.print(df['poutcome'].value_counts())
print("***********\n")
print(df['contact'].value_counts())
print("***********\n")
print(df['education'].value_counts())
print("***********\n")
print(df['job'].value_counts())

So, finally, I have replaced unknown values from the features. After performing all the above steps the final size of the dataset is 4048.

Now it’s to time check for the correlated variable. For that, I will plot a heatmap and check whether there are any correlated features or not. If the features have more than a 95% of correlation value, I will drop one of the features.

Note: It is not always recommended drop the features. If the number of features are less then we can avoid dropping the variables.

# heatmap plot help to find which two variables are correlated.
plt.figure(figsize=(15, 10))
sns.heatmap(data.corr(), annot=True)

Since none of the two variables are correlated we are good to go.

Encoding Categorical Variable

The most important step is to encode the categorical variables. since our machine does not understand text data so we will encode them into numerical values using a label encoder and one hot encoding.

# encoding categorical feature.
data['default'].replace({'yes':1,'no':0},inplace=True)
data['housing'].replace({'yes':1,'no':0},inplace=True)
data['loan'].replace({'yes':1,'no':0},inplace=True)
data['y'].replace({'yes':1,'no':0}, inplace=True)
data['contact'].replace({'cellular':1,'telephone':0},inplace=True)
data['month'].replace({'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun' : 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}, inplace = True)
# one hot encoding for other categorical features.
data1 = pd.get_dummies(data = data, columns = ['poutcome', 'education', 'marital', 'job'])
data1.head()

Scaling The data

I have scaled the values using a standard scaler. It is a technique that is used to resize the distribution of data so that the mean of the data is 0 and the standard deviation is 1. I have scaled the independent numerical feature such as age, balance, duration, pdays.

# perform standard scaler to scale down the vlaues.
scaled_col = ['age', 'balance', 'duration', 'pdays']
scaler =StandardScaler()
data1[scaled_col] = scaler.fit_transform(data1[scaled_col])

Handle imbalanced data

So after all these preprocessing techniques, the most important step which is still left is handling imbalanced data. Imbalance data refers to the condition where the target variable has an uneven distribution of observations. I will use the SMOTEK technique to handle the imbalanced data. Before I proceed to this step I will separate my target variable from the independent variable and also split the data into train set and test set.

x = data1.drop(['y'], axis = 1)
y = data1['y'] # target feature
# split the data into train and test (75 : 25)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 42)
# print shape of train and test data
print("shape of X_train :", x_train.shape)
print("shape of x_test :", x_test.shape)

Now, I will use SMOTEK to balance the data.

#!pip install imblearn
from imblearn.combine import SMOTETomek
smk = SMOTETomek(0.68)
X_res, y_res = smk.fit_resample(x_train, y_train)
print(X_res.shape, y_res.shape)
# Count the number of classes
from collections import Counter
print("The number of classes before fit {}".format(Counter(y)))
print("The number of classes after fit {}".format(Counter(y_res)))

You can see the difference between the classes before and after balancing the data. Let’s see this through visualization.

plt.figure(figsize = (10,5))
sns.countplot(x = y_res, data=data1)
plt.xticks(fontsize=12, rotation=0)
plt.yticks(fontsize=12)
plt.show()

Now the class labels are somewhat balanced. We are good to go for the next step. The final step is model building.

Model building

The final step is building the machine learning model using different algorithms. I will use different algorithm and check which one is giving better accuracy , precision and recall score.

Since it is a classification problem I will use the Classification matrix to check the performance of the model.

Let’s try algorithms one by one and see which one is giving a better results.

Logistic Regression

# training model with Logistic regression
from sklearn.linear_model import LogisticRegression
lg = LogisticRegression()
lg.fit(X_res, y_res)
Lg_pred = lg.predict(x_test)
# Get the accuracy and classification report
from sklearn import metrics
print("Accuracy is : ", metrics.accuracy_score(y_test, Lg_pred))
print(metrics.confusion_matrix(y_test, Lg_pred))
print(metrics.classification_report(y_test, Lg_pred))

Random Forest

# Build model with Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_res, y_res)
rf_pred = rf.predict(x_test)
print("Accuracy is : ", metrics.accuracy_score(y_test, rf_pred))
print(metrics.confusion_matrix(y_test, rf_pred))
print(metrics.classification_report(y_test, rf_pred))

SVM (Support Vector Machine)

# Linear SVC
from sklearn.svm import SVC
lr_svc = SVC(kernel = 'rbf')
lr_svc.fit(X_res, y_res)
sv_pred = lr_svc.predict(x_test)
print("Accurace score is : ", metrics.accuracy_score(y_test, sv_pred))
print(metrics.confusion_matrix(y_test, sv_pred))
print(metrics.classification_report(y_test, sv_pred))

XGBoost

# xgboost
from xgboost import XGBClassifier
xg = XGBClassifier()
xg.fit(X_res, y_res)
xg_pred = xg.predict(x_test)
print("Accuracy is : ", metrics.accuracy_score(y_test, xg_pred))
print(metrics.confusion_matrix(y_test, xg_pred))
print(metrics.classification_report(y_test, xg_pred))

KNN

# build model with KNN
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_res, y_res)
knn_pred = knn.predict(x_test)
# print the accuracy and classification report
print("Accuracy is : ", metrics.accuracy_score(y_test, knn_pred))
print(metrics.confusion_matrix(y_test, knn_pred))
print(metrics.classification_report(y_test, knn_pred))

Decision Tree

# Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_res, y_res)
dt_pred = dt.predict(x_test)
# print the accuracy and classification report
print("Accuracy is : ", metrics.accuracy_score(y_test, dt_pred))
print(metrics.confusion_matrix(y_test, dt_pred))
print(metrics.classification_report(y_test, dt_pred))
  1. Among all the algorithms used above XGB is giving a slightly better result.
  2. The precision and recall score of XGB is slightly better than other models.

--

--

No responses yet