A bank while reviewing its customer base found that they have increased significant number of liability customers (depositors) in comparison to borrowers (asset customers).
Now they want to aggressively increase their asset customers by providing loan against their credit card. This will not only make a balance between the categories of their customer base, but also help them to earn an interest rate with better margin.
The bank had executed a campaign to provide loan but they were not satisfied since they had a single digit success rate. This time they want significantly a better performance without increasing their campaign budget.
Now they have hired a data science company - Analytics Educator, who can guide them to achieve their goals without increasing their cost. Analytics Educator will be using different Machine learning algorithm to solve this problem.
It's a very frequently occurring problem of the financial institutions, hence we have taken up this case study to show our readers how a real life prject is done in the corporate world.
In this FREE case study, Analytics Educator will show all their readers how to get this real life business problem can be solved. We will show you a step by step approach to solve this problem, using machine learning algorithm.
• DATA DESCRIPTION: The data consists of the following attributes:
ID: Customer ID Age Customer’s approximate age. CustomerSince: Customer of the bank since. HighestSpend: Customer’s highest spend so far in one transaction. ZipCode: Customer’s zip code. HiddenScore: A score associated to the customer which is masked by the bank as an IP. MonthlyAverageSpend: Customer’s monthly average spend so far. Level: A level associated to the customer which is masked by the bank as an IP. Mortgage: Customer’s mortgage. Security: Customer’s security asset with the bank. FixedDepositAccount: Customer’s fixed deposit account with the bank. InternetBanking: if the customer uses internet banking. CreditCard: if the customer uses bank’s credit card. LoanOnCard: if the customer has a loan on credit card.
Here LoanOnCard is the one which we will be predicting (our dependent variable). Rest of the variables are our independent variable.
We will be predicting the dependent variable using different machine learning algorithms like Logistic Regression, Decision Tree and Random Forest. Once done we will compare their results to zero down on the best algorithm.
By using machine learning algorithm we aim to significantly improve the accuracy of our campaign than the base line (the current accuracy which we have)
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None # removes warning messages
os.chdir("C:\\Users\\ASUS")
pd.set_option('display.max_columns', None)
df = pd.read_csv("full_data.csv")
# Preview our data
df.tail()
Unnamed: 0 | ID | Age | CustomerSince | HighestSpend | ZipCode | HiddenScore | MonthlyAverageSpend | Level | Mortgage | Security | FixedDepositAccount | InternetBanking | CreditCard | LoanOnCard | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4975 | 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 1 | 0 | 0.0 |
4976 | 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 1 | 0 | 0.0 |
4977 | 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0.0 |
4978 | 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 1 | 0 | 0.0 |
4979 | 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 1 | 1 | 0.0 |
# frequency dist for all variables
for column in df.columns:
print("\n" + column)
print(df[column].value_counts())
Unnamed: 0 2047 1 4659 1 553 1 4651 1 2604 1 .. 1222 1 3271 1 1226 1 3275 1 2049 1 Name: Unnamed: 0, Length: 4980, dtype: int64 ID 2047 1 2612 1 4651 1 2604 1 557 1 .. 3271 1 1226 1 3275 1 1230 1 2049 1 Name: ID, Length: 4980, dtype: int64 Age 43 149 35 148 52 145 54 143 58 143 30 136 50 136 56 135 41 135 34 134 39 132 59 132 57 132 51 129 60 126 45 126 46 126 42 126 31 125 55 125 40 124 62 123 29 123 61 122 44 121 32 120 33 119 48 117 38 115 49 115 47 112 53 111 63 108 36 107 37 105 28 103 27 90 65 79 64 78 26 78 25 51 24 28 66 24 23 12 67 12 Name: Age, dtype: int64 CustomerSince 32 154 20 148 5 146 9 145 23 144 35 143 25 142 28 138 18 137 19 134 26 133 24 130 3 129 14 127 30 126 34 125 17 125 16 125 29 124 27 124 7 121 22 121 6 119 15 118 8 118 33 117 10 117 37 116 13 116 11 116 4 113 36 113 21 113 31 104 12 102 38 88 39 85 2 84 1 73 0 66 40 57 41 42 -1 32 -2 15 42 8 -3 4 43 3 Name: CustomerSince, dtype: int64 HighestSpend 44 85 38 84 81 82 41 81 39 81 .. 189 2 202 2 205 2 224 1 218 1 Name: HighestSpend, Length: 162, dtype: int64 ZipCode 94720 167 94305 125 95616 116 90095 71 93106 57 ... 96145 1 94970 1 94598 1 90068 1 94087 1 Name: ZipCode, Length: 467, dtype: int64 HiddenScore 1 1466 2 1293 4 1215 3 1006 Name: HiddenScore, dtype: int64 MonthlyAverageSpend 0.30 240 1.00 229 0.20 204 2.00 188 0.80 187 ... 3.25 1 8.20 1 9.30 1 8.90 1 5.33 1 Name: MonthlyAverageSpend, Length: 108, dtype: int64 Level 1 2089 3 1496 2 1395 Name: Level, dtype: int64 Mortgage 0 3447 98 17 91 16 83 16 89 16 ... 541 1 509 1 505 1 485 1 577 1 Name: Mortgage, Length: 347, dtype: int64 Security 0 4460 1 520 Name: Security, dtype: int64 FixedDepositAccount 0 4678 1 302 Name: FixedDepositAccount, dtype: int64 InternetBanking 1 2974 0 2006 Name: InternetBanking, dtype: int64 CreditCard 0 3514 1 1466 Name: CreditCard, dtype: int64 LoanOnCard 0.0 4500 1.0 480 Name: LoanOnCard, dtype: int64
# dropping ID and ZipCode
df = df.drop(["ID","ZipCode"],axis=1)
df.describe()
Unnamed: 0 | Age | CustomerSince | HighestSpend | HiddenScore | MonthlyAverageSpend | Level | Mortgage | Security | FixedDepositAccount | InternetBanking | CreditCard | LoanOnCard | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4980.000000 | 4980.000000 | 4980.000000 | 4980.00000 | 4980.000000 | 4980.000000 | 4980.000000 | 4980.000000 | 4980.000000 | 4980.000000 | 4980.000000 | 4980.000000 | 4980.000000 |
mean | 2509.345382 | 45.352610 | 20.117671 | 73.85241 | 2.395582 | 1.939536 | 1.880924 | 56.589759 | 0.104418 | 0.060643 | 0.597189 | 0.294378 | 0.096386 |
std | 1438.011129 | 11.464212 | 11.468716 | 46.07009 | 1.147200 | 1.750006 | 0.840144 | 101.836758 | 0.305832 | 0.238697 | 0.490513 | 0.455808 | 0.295149 |
min | 9.000000 | 23.000000 | -3.000000 | 8.00000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 1264.750000 | 35.000000 | 10.000000 | 39.00000 | 1.000000 | 0.700000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 2509.500000 | 45.000000 | 20.000000 | 64.00000 | 2.000000 | 1.500000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
75% | 3754.250000 | 55.000000 | 30.000000 | 98.00000 | 3.000000 | 2.525000 | 3.000000 | 101.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 |
max | 4999.000000 | 67.000000 | 43.000000 | 224.00000 | 4.000000 | 10.000000 | 3.000000 | 635.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
df=df.loc[df["CustomerSince"]>=0,]
df.isnull().sum()
Unnamed: 0 0 Age 0 CustomerSince 0 HighestSpend 0 HiddenScore 0 MonthlyAverageSpend 0 Level 0 Mortgage 0 Security 0 FixedDepositAccount 0 InternetBanking 0 CreditCard 0 LoanOnCard 0 dtype: int64
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(), annot=True)
<AxesSubplot:>
import seaborn as sns
import matplotlib.pyplot as plt
#Number of data by Level
plt.figure(figsize=[10,5])
sns.countplot(x = 'Level',hue = 'LoanOnCard', data = df)
<AxesSubplot:xlabel='Level', ylabel='count'>
#Number of data by HiddenScore
plt.figure(figsize=[10,5])
sns.countplot(x = 'HiddenScore',hue = 'LoanOnCard', data = df)
<AxesSubplot:xlabel='HiddenScore', ylabel='count'>
plt.figure(figsize=(20,5))
sns.countplot(x = 'Age', hue = 'LoanOnCard', data=df)
<AxesSubplot:xlabel='Age', ylabel='count'>
#Pair plot
sns.pairplot(df, hue = 'LoanOnCard',
vars = ['Age', 'CustomerSince',
"HighestSpend","MonthlyAverageSpend","Mortgage"] )
<seaborn.axisgrid.PairGrid at 0x2132ea58>
plt.figure(figsize=(20,5))
sns.countplot(x = 'MonthlyAverageSpend', hue = 'LoanOnCard', data=df)
<AxesSubplot:xlabel='MonthlyAverageSpend', ylabel='count'>
plt.hist(df['MonthlyAverageSpend'])
(array([1670., 1346., 1017., 319., 218., 97., 131., 80., 45., 6.]), array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]), <BarContainer object of 10 artists>)
#Let's drop the target coloumn before we do train test split
X = df.drop('LoanOnCard',axis=1)
y = df['LoanOnCard']
#Now we will split the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
C:\Users\ASUS\anaconda3\envs\py36\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression(random_state=0)
#Predict
y_predict_test = classifier.predict(X_test)
#Check accuracy
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test))
precision recall f1-score support 0.0 0.95 0.98 0.97 880 1.0 0.81 0.57 0.67 106 accuracy 0.94 986 macro avg 0.88 0.78 0.82 986 weighted avg 0.93 0.94 0.93 986
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=0)
classifier.fit(X_train, y_train)
DecisionTreeClassifier(random_state=0)
#Predict the model
y_pred = classifier.predict(X_test)
#Check accuracy
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support 0.0 0.99 1.00 0.99 880 1.0 0.99 0.92 0.95 106 accuracy 0.99 986 macro avg 0.99 0.96 0.97 986 weighted avg 0.99 0.99 0.99 986