A Support Vector Machine (SVM) is a powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs are particularly well suited for classification of complex small- or medium-sized datasets. In this article, we will take up a classification problem and perform the Support Vector Machine algorithm. We will explain the results in simple english to make it easy to understand.
Suppose you work as a data scientist at a major bank in NYC and you have been tasked to develop a model that can predict whether a customer is able to retire or not based on his/her features. Features are his/her age and net savings (retirement savings in the U.S.). Here Retire is your dependent variable and Age and Savings will be your independent variables. You thought that applying a machine learning algorithm like Support Vector Machine (SVM) can be of great help to solve the problem.
You may also apply other classification algorithms to figure out the accuracy and compare it with SVM though.
# import libraries
import pandas as pd # Import Pandas for data manipulation using dataframes
import numpy as np # Import Numpy for data statistical analysis
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import seaborn as sns # Statistical data visualization
import os
import warnings
warnings.filterwarnings("ignore")
os.chdir("D:\\Python\\5")
bank_df = pd.read_csv('Bank_Customer_retirement.csv')
bank_df.head()
Customer ID | Age | Savings | Retire | |
---|---|---|---|---|
0 | 0 | 39.180417 | 322349.8740 | 0 |
1 | 1 | 56.101686 | 768671.5740 | 1 |
2 | 2 | 57.023043 | 821505.4718 | 1 |
3 | 3 | 43.711358 | 494187.4850 | 0 |
4 | 4 | 54.728823 | 691435.7723 | 1 |
# Dropping Customer ID
bank_df.drop("Customer ID",axis=1,inplace=True)
bank_df.head(2)
Age | Savings | Retire | |
---|---|---|---|
0 | 39.180417 | 322349.874 | 0 |
1 | 56.101686 | 768671.574 | 1 |
sns.pairplot(bank_df, hue = 'Retire', vars = ['Age', 'Savings'] )
<seaborn.axisgrid.PairGrid at 0x1ad5add8>
sns.countplot(bank_df['Retire'], label = "Retirement")
<AxesSubplot:xlabel='Retire', ylabel='count'>
# number of missing values by variables
bank_df.isnull().sum()
Age 0 Savings 0 Retire 0 dtype: int64
# Let's drop the target label coloumns to save only the independent variable
X = bank_df.drop(['Retire'],axis=1)
# Let's save the target label coloumns as y
y = bank_df['Retire']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=1005)
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)
SVC()
from sklearn.metrics import classification_report, confusion_matrix
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)
<AxesSubplot:>
print(classification_report(y_test, y_predict))
precision recall f1-score support 0 0.94 0.85 0.90 55 1 0.84 0.93 0.88 45 accuracy 0.89 100 macro avg 0.89 0.89 0.89 100 weighted avg 0.90 0.89 0.89 100
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)
SVC()
from sklearn.metrics import classification_report, confusion_matrix
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
print(classification_report(y_test, y_predict))
precision recall f1-score support 0 0.95 0.95 0.95 55 1 0.93 0.93 0.93 45 accuracy 0.94 100 macro avg 0.94 0.94 0.94 100 weighted avg 0.94 0.94 0.94 100
param_grid = {'C': [0.001, 0.1, 1], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True)
grid.fit(X_train,y_train)
GridSearchCV(estimator=SVC(), param_grid={'C': [0.001, 0.1, 1], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid']})
grid_predictions = grid.predict(X_test)
cm = confusion_matrix(y_test, grid_predictions)
print(classification_report(y_test,grid_predictions))
precision recall f1-score support 0 0.96 0.95 0.95 55 1 0.93 0.96 0.95 45 accuracy 0.95 100 macro avg 0.95 0.95 0.95 100 weighted avg 0.95 0.95 0.95 100
grid.best_params_
{'C': 1, 'gamma': 1, 'kernel': 'linear'}