Analytics Educator
  • Home
  • Courses
  • Blog
  • FAQ
  • Contact Us
  • Home
  • Courses
  • FAQ
  • Contact
Home   /   Blog   /   Details

BREAST CANCER CLASSIFICATION USING SUPPORT VECTOR MACHINES

Support Vector Machine

SVM is one of the most popular algorithms when it comes to high-dimensional spaces. The goal of the algorithm is to find a decision boundary in order to separate data from different classes. We will be discussing in detail how that works. Also, we will be implementing the algorithm with scikit-learn, and applying it to solve various real-life problems, including our main project of cancer prediction. We will also be showing how to use different tactics to further improve the accuracies.

PROBLEM STATEMENT

Here we have a dataset of different patients, showing the different characteristics of their cells which was suspected to be cancerous. After thorough diagnosis, it was determined whether it was Malignant (fatal) or Benign (not so harmful). Now we will be using a machine learning algorithm, Support Vector Machine - a classification technique to classify the cells between Malignant and Benign using Python. Then we will match our predictions with original data to check the model's accuracy.

  • Predicting if the cancer diagnosis is benign or malignant based on several observations/features
  • 30 features are used, examples:

    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry 
    - fractal dimension ("coastline approximation" - 1)
  • Datasets are linearly separable using all 30 input features

  • Number of Instances: 569
  • Class Distribution: 212 Malignant, 357 Benign
  • Target class:
    - Malignant
    - Benign

Importing Library¶

In [45]:
# import libraries 
import pandas as pd # Import Pandas for data manipulation using dataframes
import numpy as np # Import Numpy for data statistical analysis 
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import seaborn as sns # Statistical data visualization
# %matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Load the data - inbuilt dataset from Scikit Learn package¶

In [46]:
# Import Cancer data drom the Sklearn library
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
# Read the DataFrame, first using the feature data
df = pd.DataFrame(data.data, columns=data.feature_names)
# Add a target column, and fill it with the target data
df['target'] = data.target
# Show the first five rows
df.head()
Out[46]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

If the target column is 1 it means it's a malignant (fatal) and if it's 0 it means Benign (not so harmful)¶

Rest of the column shows the different characteristics of the cell found in the patient's body¶

VISUALIZING THE DATA¶

Here we are showing the histogram and scatter plot for all variables for different values of Target values.¶

Observation: There is a clear demarcation between Target values (0 and 1) in respect to other independent variables. We can conclude that both the independent variables (all variables except Target) are likely to predict the dependent variable (Target) to a great extent.¶

In [47]:
sns.pairplot(df, hue = 'target', vars = ['mean radius', 'mean texture', 'mean area', 'mean perimeter', 'mean smoothness'] )
Out[47]:
<seaborn.axisgrid.PairGrid at 0x2186c4a8>

Now we are checking the frequency distribution of the Retire columns.¶

Observation: Number of 1 and 0 are similar, hence we don't have any problem of imbalanced data¶

In [48]:
sns.countplot(df['target'], label = "Target") 
Out[48]:
<AxesSubplot:xlabel='target', ylabel='count'>
In [49]:
sns.scatterplot(x = 'mean area', y = 'mean smoothness', hue = 'target', data = df)
Out[49]:
<AxesSubplot:xlabel='mean area', ylabel='mean smoothness'>
In [50]:
sns.lmplot('mean area', 'mean smoothness', hue ='target', data = df, fit_reg=False)
Out[50]:
<seaborn.axisgrid.FacetGrid at 0x1f2c2320>
In [51]:
# Let's check the correlation between the variables 
# Strong correlation between the mean radius and mean perimeter, mean area and mean primeter
plt.figure(figsize=(20,10)) 
sns.heatmap(df.corr(), annot=True) 
Out[51]:
<AxesSubplot:>

Checking for missing values¶

In [52]:
# number of missing values by variables
df.isnull().sum()
Out[52]:
mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
target                     0
dtype: int64

Observation: There are no missing values in the data¶

Data split¶

Segregating the independent variables as X and dependent variable as y¶

In [53]:
# Let's drop the target label coloumns
X = df.drop(['target'],axis=1)
In [54]:
y = df['target']

Now we will split the data into training (80% of the data) and rest 20% - named test, will be kept aside for later use.¶

In [55]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)

MODEL TRAINING using Scikit Learn¶

We will be using Support Vector Machine algorithm from the Scikit Learn package. We are using all the default parameters of the package. Later, we will try to tune the hyper parameters to further increase the accuracy.¶

In [56]:
from sklearn.svm import SVC 
from sklearn.metrics import classification_report, confusion_matrix

svc_model = SVC()
svc_model.fit(X_train, y_train)
Out[56]:
SVC()

EVALUATING THE MODEL¶

Once the model is executed, we will predict the test data with our model.¶

In [57]:
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)
Out[57]:
<AxesSubplot:>
In [58]:
print(classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           0       1.00      0.85      0.92        48
           1       0.90      1.00      0.95        66

    accuracy                           0.94       114
   macro avg       0.95      0.93      0.94       114
weighted avg       0.94      0.94      0.94       114

Overall accuracy is 94% and precision for 0 and 1 are 100% and 90% respectively.¶

Improving the model using feature scaling¶

In [59]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Model training¶

In [60]:
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)
Out[60]:
SVC()

Evaluating the model with scaled data¶

In [61]:
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)
Out[61]:
<AxesSubplot:>
In [62]:
from sklearn.metrics import classification_report, confusion_matrix
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
print(classification_report(y_test, y_predict))
              precision    recall  f1-score   support

           0       0.98      0.94      0.96        48
           1       0.96      0.98      0.97        66

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

Observation: The accuracy has improved to 96% from the previous model accuracy of 94%¶

IMPROVING THE MODEL - Hyper parameter tuning¶

We will now tune different values of C parameters, gamma and different kernels to further fine tune the results in order to achieve higher accuracy¶

In [63]:
param_grid = {'C': [0.001, 0.1, 1], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['linear', 'poly', 'rbf', 'sigmoid']} 
In [64]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True)
grid.fit(X_train,y_train)
Out[64]:
GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.001, 0.1, 1], 'gamma': [1, 0.1, 0.01, 0.001],
                         'kernel': ['linear', 'poly', 'rbf', 'sigmoid']})
In [65]:
grid.best_params_
Out[65]:
{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}
In [66]:
grid.best_estimator_
Out[66]:
SVC(C=0.1, gamma=1, kernel='linear')
In [67]:
grid_predictions = grid.predict(X_test)
cm = confusion_matrix(y_test, grid_predictions)
sns.heatmap(cm, annot=True)
Out[67]:
<AxesSubplot:>
In [68]:
print(classification_report(y_test,grid_predictions))
              precision    recall  f1-score   support

           0       0.98      0.96      0.97        48
           1       0.97      0.98      0.98        66

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Observation: The accuracy has further improved to 97% from the previous model accuracy of 96%. This is will be our final result¶

This is how we can check for the best combination of hyper-parameters to get the best result.¶

In [69]:
grid.best_params_
Out[69]:
{'C': 0.1, 'gamma': 1, 'kernel': 'linear'}

¶

Analytics Educator is the best institute for Data Science courses, based out of Kolkata. We specialize in providing training on data science even to students coming from non-technical background with zero programming or statistical knowledge. We help the associates to learn data science and get job in this field. You may check out all our instructor led courses from this link. https://analyticseducator.com/Courses-Offers.html¶

Copyright © 2017 Analytics Educator