Analytics Educator
  • Home
  • Courses
  • Blog
  • FAQ
  • Contact Us
  • Home
  • Courses
  • FAQ
  • Contact
Home   /   Blog   /   Details

Building a Logistic regression in Python, step by step, using Titanic Data¶

Logistic Regression¶

As it happens, there is a separate regression model that takes care of a situation where the output variable is a binary or categorical variable rather than a continuous variable. This model is called logistic regression. In other words, logistic regression is a variation of linear regression where the output variable is a binary or categorical variable. The two regressions are similar in the sense that they both assume a linear relationship between the predictor and output variables. However, as we will see soon, the output variable needs to undergo some transformation in the case of logistic regression.

A few scenarios where logistic regression can be applied are as follows:

To predict whether a random customer will buy a particular product or not, given his details such as income, gender, shopping history, and advertisement history To predict whether a team will win or lose a match, given the match and team details such as weather, form of players, stadium, and hours spent in training Note how the output variable in both the cases is a binary or categorical variable.

PROBLEM STATEMENT¶

The sinking of the Titanic on April 15th, 1912 is one of the most tragic tragedies in history. The Titanic sank, during her maiden voyage, after colliding with an iceberg, killing 1502 out of 2224 passengers. The numbers of survivors were low due to the lack of lifeboats for all passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others, such as women, children, and upper-class. This case study analyzes what sorts of people were likely to survive this tragedy. The dataset includes the following:

  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
  • Sex: Sex
  • Age: Age in years
  • Sibsp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
  • Target class: Survived: Survival (0 = No, 1 = Yes)

DATA SOURCE: https://www.kaggle.com/c/titanic¶

Methodology¶

We will be using a machine learning algorithm - binary logistic regression using Python to predict whether the person will survive or not (usually denoted with 1 and 0). Logistic regression is a classification machine learning algorithm which is suitable for binary dependent variable and provides the output as the probability of being 1.¶

We have 2 datasets - train and test. We will build the model on the train data and then test the model on test to check the accuracy.¶

First we will do exploratory data analysis, followed by feature engineering, outlier and missing value treatment.¶

Then we will run logistic regression using both StatsModel and Scikit Learn (sklearn) and compare their accuracy, to judge which provides a better output¶

IMPORT LIBRARIES¶

We are importing all the required libraries

In [1]:

import pandasas pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
                    

IMPORT DATASET¶

In [2]:
# read the data using pandas dataframe
import os
os.chdir("D:\\Python case study 4")
train = pd.read_csv('Train_Titanic.csv')
                    
In [3]:
# Show the data head!
train.head()
                    
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

The data has been properly imported.

Observation : We can't predict anything with the PasserngerId, Name, and Ticket column, hence we will drop it. Here, we understand Survived is our dependent variable. The place where the person has boarded the ship (Embarked column) shouldn't be predicting their chance of survival, hence it will also be dropped.¶

In [5]:
train.drop(["PassengerId","Name","Ticket","Embarked"],axis=1,inplace=True)
train.head(2)
                    
Out[5]:
Survived Pclass Sex Age SibSp Parch Fare Cabin
0 0 3 male 22.0 1 0 7.2500 NaN
1 1 1 female 38.0 1 0 71.2833 C85

EXPLORE/VISUALIZE DATASET¶

In [6]:
# EXPLORE/VISUALIZE DATASET# Let's count the number of survivors and non-survivors
train['Survived'].value_counts()
                    
Out[6]:
0    549
                    1    342
                    Name: Survived, dtype: int64

Observation : Here 1 means survived and 0 means died. Pretty much balanced data, since number of 1 and 0 are close.¶

In [7]:
#Number of people travelling by different class
plt.figure(figsize=[10,5])
sns.countplot(x = 'Pclass', data = train)
                    
Out[7]:
<AxesSubplot:xlabel='Pclass', ylabel='count'>

Observation : Maximum people were travelling by 3rd class¶

In [8]:
plt.figure(figsize=[10,5])#### Observation :  Here 1 means survived and 0 means died. Pretty much balanced data, since number of 1 and 0 are close.
sns.countplot(x = 'Pclass', hue = 'Survived', data=train)
                    
Out[8]:
<AxesSubplot:xlabel='Pclass', ylabel='count'>

Observation : More people travelling by 1st class survived; it's almost equal for 2nd class (marginally more people die though), and most of the people travelling by 3rd class died. This chart shows that the class had some impact whether a person would survive or not.¶

Now the same analysis will be made on other independent variables¶

In [9]:
plt.figure(figsize=[10,5])
sns.countplot(x = 'SibSp', hue = 'Survived', data=train)
                    
Out[9]:
<AxesSubplot:xlabel='SibSp', ylabel='count'>

Observation : Bar Chart to indicate the number of people survived based on their siblings status. If you have 1 siblings (SibSp = 1), you have a higher chance of survival compared to being alone (SibSp = 0)¶

In [10]:
plt.figure(figsize=[10,5])
sns.countplot(x = 'Parch', hue = 'Survived', data=train)
                    
Out[10]:
<AxesSubplot:xlabel='Parch', ylabel='count'>

Observation : Bar Chart to indicate the number of people survived based on their Parch status (how many parents onboard). If you have 1, 2, or 3 family members (Parch = 1,2), you have a higher chance of survival compared to being alone (Parch = 0)¶

In [11]:
plt.figure(figsize=[10,5])
sns.countplot(x = 'Sex', hue = 'Survived', data=train)
                    
Out[11]:
<AxesSubplot:xlabel='Sex', ylabel='count'>

Observation : Bar Chart to indicate the number of people survived based on their sex. If you are a female, you have a higher chance of survival compared to other ports!¶

Female and children were given first preference for safety, hence it makes sense that gender will help to predict the chances of survival.¶

In [12]:
plt.figure(figsize=(20,5))
sns.countplot(x = 'Age', hue = 'Survived', data=train)
                    
Out[12]:
<AxesSubplot:xlabel='Age', ylabel='count'>

Observation : Bar Chart to indicate the number of people survived based on their age. If you are a baby, you have a higher chance of survival¶

Female and children were given first preference for safety, hence it makes sense that age will help to predict the chances of survival.¶

In [13]:
# Age Histogram 
train['Age'].hist(bins = 40)
                    
Out[13]:
<AxesSubplot:>

Observation : The histogram shows that majorly people were around the age of 20 to 30¶

In [14]:
plt.figure(figsize=(80,40))
sns.countplot(x = 'Fare', hue = 'Survived', data=train)
                    
Out[14]:
<AxesSubplot:xlabel='Fare', ylabel='count'>

Observation : # Bar Chart to indicate the number of people survived based on their fare. If you pay a higher fare, you have a higher chance of survival¶

In [15]:
# Fare Histogram 
train['Fare'].hist(bins = 40)
                    
Out[15]:
<AxesSubplot:>

Observation : # Mostly people had paid low value fare. Only a handful of people had paid high fare.¶

PREPARE THE DATA FOR TRAINING/ DATA CLEANING¶

In [16]:
# number of missing values by variables
train.isnull().sum()
                    
Out[16]:
Survived      0
                    Pclass        0
                    Sex           0
                    Age         177
                    SibSp         0
                    Parch         0
                    Fare          0
                    Cabin       687
                    dtype: int64

Observation: There are missing values for Age and Cabin¶

In [17]:
# percentage of missing values by variables
train.isnull().mean()*100
                    
Out[17]:
Survived     0.000000
                    Pclass       0.000000
                    Sex          0.000000
                    Age         19.865320
                    SibSp        0.000000
                    Parch        0.000000
                    Fare         0.000000
                    Cabin       77.104377
                    dtype: float64

Observation: Missing values are shown in percentage. 77% of total values in Cabin is missing, and 19.8% values of Age is missing.¶

In [18]:
# Let's visualize which variables in the dataset are missing
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
                    
Out[18]:
<AxesSubplot:>

Observation: Since a very high percentage of values in Cabin are missing, this variable is not going to help us in the model. We will drop it from the dataset¶

In [19]:
# Dropping the Cabin column
train.drop('Cabin',axis=1,inplace=True)
                    
In [20]:
train.head()
                    
Out[20]:
Survived Pclass Sex Age SibSp Parch Fare
0 0 3 male 22.0 1 0 7.2500
1 1 1 female 38.0 1 0 71.2833
2 1 3 female 26.0 0 0 7.9250
3 1 1 female 35.0 1 0 53.1000
4 0 3 male 35.0 0 0 8.0500
In [21]:
# Let's view the missing values in the data one more time!
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
                    
Out[21]:
<AxesSubplot:>

Observation: There are 19.8% of missing values in Age, we can't entirely drop the column, nor we can keep the missing values. We will replace them with the average. However, we can't replace all the missing values with the average of Age. It would be misleading.¶

The mean of total Age could would be:¶

In [48]:
#Mean of total Age
train.Age.mean()
                    
Out[48]:
29.736034227171306

Observation: If we check the average age for different sex (male and female), we can see they are different values.¶

In [23]:
# Let's get the average age for male (~29) and female (~25)
plt.figure(figsize=(15, 10))
sns.boxplot(x='Sex', y='Age',data=train)
                    
Out[23]:
<AxesSubplot:xlabel='Sex', ylabel='Age'>

Hence, we should be replacing the missing Age values for female with average age of female and replace the missing Age values for male with average age of male.¶

In [24]:
#Shows the missing values of Age for male
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"].head()
                    
Out[24]:
5    NaN
                    17   NaN
                    26   NaN
                    29   NaN
                    36   NaN
                    Name: Age, dtype: float64
In [25]:
#Shows the average age for male
train.loc[train["Sex"] == "male","Age"].mean()
                    
Out[25]:
30.72664459161148
In [26]:
#Replace missing age for male and female with average age of male and female respectively
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"] = train.loc[train["Sex"] == "male","Age"].mean()
train.loc[(train["Age"].isnull()) & (train["Sex"] == "female"),"Age"] = train.loc[train["Sex"] == "female","Age"].mean()
                    
In [27]:
#Check again for missing values 
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
# now there are no missing values
                    
Out[27]:
<AxesSubplot:>

Now there are no missing values in the dataset. We have completed the data cleaning.¶

Create Dummy variables¶

We need to create the dummy variables for all the categorical variables.¶

In [28]:
train.head()
                    
Out[28]:
Survived Pclass Sex Age SibSp Parch Fare
0 0 3 male 22.0 1 0 7.2500
1 1 1 female 38.0 1 0 71.2833
2 1 3 female 26.0 0 0 7.9250
3 1 1 female 35.0 1 0 53.1000
4 0 3 male 35.0 0 0 8.0500

Observation : We can see that here we have only one categorical variable, i.e. Sex. We will create the dummy variable as shown below:¶

In [29]:
train = pd.get_dummies(data=train, columns=['Sex'],drop_first=True)
                    
In [30]:
train.head()
                    
Out[30]:
Survived Pclass Age SibSp Parch Fare Sex_male
0 0 3 22.0 1 0 7.2500 1
1 1 1 38.0 1 0 71.2833 0
2 1 3 26.0 0 0 7.9250 0
3 1 1 35.0 1 0 53.1000 0
4 0 3 35.0 0 0 8.0500 1

Data split¶

We will now split the data into dependent (y) and independent variable (X)¶

In [31]:
#Let's drop the target coloumn before we do train test split
X = train.drop('Survived',axis=1)
y = train['Survived']
                    

Now we will split the data into training (80% of the data) and rest 20% - named test, will be kept aside for later use.¶

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
                    

MODEL TRAINING using Scikit Learn¶

First we will run the Multiple Logistic regression model using the Scikit Learn package. This is one of the popular package of Python to run Logistic Regression. Here, we are not required to check the p values, confidence intervals or multicollinearity. All will be taken care of by Python¶

In [33]:
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
                    
Out[33]:
LogisticRegression(random_state=0)

MODEL TESTING¶

Once the model is executed, we will predict the test data with our model.¶

In [34]:
y_predict_test = classifier.predict(X_test)
                    

Now we will check the confusion matrix and accuracy of the model¶

In [35]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_predict_test,y_test))
                    
[[100  17]
                     [ 17  45]]
                    
In [36]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test))
                    
              precision    recall  f1-score   support
                    
                               0       0.85      0.85      0.85       117
                               1       0.73      0.73      0.73        62
                    
                        accuracy                           0.81       179
                       macro avg       0.79      0.79      0.79       179
                    weighted avg       0.81      0.81      0.81       179
                    
                    

Overall accuracy is 81% and precision for 0 and 1 are 85% and 73% respectively.¶

MODEL TRAINING using Statsmodel¶

Statsmodel is a very important package for machine learning algorithm like Logistic Regression. This enables us to understand the correlation of the different independent variables with the dependent variable.¶

It also provides different model validation parameters to check and fine-tune the model.¶

In [37]:
#Import package and run the model
import statsmodels.api as sm
logit = sm.Logit(y_train, X_train)
# fit the model
result = logit.fit()
result.summary()#all p values are significant
                    
Optimization terminated successfully.
                             Current function value: 0.505715
                             Iterations 6
                    
Out[37]:
Logit Regression Results
Dep. Variable: Survived No. Observations: 712
Model: Logit Df Residuals: 706
Method: MLE Df Model: 5
Date: Mon, 28 Mar 2022 Pseudo R-squ.: 0.2454
Time: 12:55:37 Log-Likelihood: -360.07
converged: True LL-Null: -477.17
Covariance Type: nonrobust LLR p-value: 1.343e-48
coef std err z P>|z| [0.025 0.975]
Pclass 0.1182 0.076 1.554 0.120 -0.031 0.267
Age 0.0101 0.006 1.647 0.100 -0.002 0.022
SibSp -0.3548 0.110 -3.229 0.001 -0.570 -0.139
Parch -0.1766 0.119 -1.482 0.138 -0.410 0.057
Fare 0.0163 0.003 5.126 0.000 0.010 0.023
Sex_male -2.2946 0.202 -11.370 0.000 -2.690 -1.899

We will remove all the variables with p value (P>|z| column) more than 0.05. If there are more than one such variable then we will remove them from the model one at a time, based on lowest absolute z value (z column) to be removed first.¶

Then we will check whether all confidence intervals ([0.025 0.975] columns) are of same sign. If not then we should remove them from the model.¶

In [38]:
#Here we have removed the insignificant variables one at a time.
X_train1 = X_train.drop(["Parch","Pclass"],axis=1)
                    
In [39]:
logit = sm.Logit(y_train, X_train1)
# fit the model
result = logit.fit()
result.summary()#all p values are significant
                    
Optimization terminated successfully.
                             Current function value: 0.508274
                             Iterations 6
                    
Out[39]:
Logit Regression Results
Dep. Variable: Survived No. Observations: 712
Model: Logit Df Residuals: 708
Method: MLE Df Model: 3
Date: Mon, 28 Mar 2022 Pseudo R-squ.: 0.2416
Time: 12:55:37 Log-Likelihood: -361.89
converged: True LL-Null: -477.17
Covariance Type: nonrobust LLR p-value: 1.047e-49
coef std err z P>|z| [0.025 0.975]
Age 0.0147 0.005 3.044 0.002 0.005 0.024
SibSp -0.3412 0.096 -3.542 0.000 -0.530 -0.152
Fare 0.0142 0.003 4.972 0.000 0.009 0.020
Sex_male -2.1537 0.186 -11.569 0.000 -2.519 -1.789
In [40]:
data = pd.concat([y_train,X_train1],axis=1)
data.head()
                    
Out[40]:
Survived Age SibSp Fare Sex_male
57 0 28.500000 0 7.2292 1
717 1 27.000000 0 10.5000 0
431 1 27.915709 1 16.1000 0
633 0 30.726645 0 0.0000 1
163 0 17.000000 0 8.6625 1

Here we will check multicollinearity with VIF values.¶

Logistic regression doesn't have the function to check VIF, hence we will run linear regression to check VIF¶

In [41]:
### VIF Calculation
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
                    
                    
result=sm.ols(formula="Survived~Age+SibSp+Fare+Sex_male",
             data=data).fit()
result.summary()# shows total summary

#remove variable based on vif
#all vif values are under 2, hence no variable is removed

var = pd.DataFrame(round(result.pvalues,3))# shows p value
var["coeff"] = result.params#coefficients
variables = result.model.exog #.if I had saved data as rock
# this it would have looked like rock.model.exog
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif 
var["vif"] = vif
var
                    
Out[41]:
0 coeff vif
Intercept 0.000 0.774683 8.804167
Age 0.073 -0.002176 1.091592
SibSp 0.000 -0.073560 1.092640
Fare 0.000 0.001684 1.073503
Sex_male 0.000 -0.522593 1.050660

Here we need to check if any value under vif column crosses the threshold limit of 2 (some projects also consider it as more than 2 or less than 2). However, Intercept value should be ignored. Model becomes biased without intercept.¶

We don't have vif value more than 2, hence we will keep all the variables.¶

Now we will predict the values of test data and check the accuracy¶

In [42]:
y_predict_test1 = result.predict(X_test)
                    
In [43]:
y_predict_test1 = np.where(y_predict_test1 >= 0.5,1,0)
                    
In [44]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_predict_test1,y_test))
                    
[[104  20]
                     [ 13  42]]
                    
In [45]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test1))
                    
              precision    recall  f1-score   support
                    
                               0       0.84      0.89      0.86       117
                               1       0.76      0.68      0.72        62
                    
                        accuracy                           0.82       179
                       macro avg       0.80      0.78      0.79       179
                    weighted avg       0.81      0.82      0.81       179
                    
                    

Overall accuracy is 82% and precision for 0 and 1 are 84% and 76% respectively.¶

We can see that the statsmodel has outperformed the scikit learn in terms of overall accuracy and precision.¶

We can conclude that we should prefer statsmodel over scikitlearn, in case of Binary Logistic Regression, in order to achieve higher accuracy.¶

Sign off note : We have only used Logistic Regression as our classification technique. There are other classification techniques (ANN, XGB, KNN, SVM etc) as well, which may be used and compare the results to check for higher accuracy.¶

¶

Analytics Educator is the best institute for Data Science courses, based out of Kolkata. We specialize in providing training on data science even to students coming from non-technical background with zero programming or statistical knowledge. We help the associates to learn data science and get job in this field. You may check out all our instructor led courses from this link. https://www.analyticseducator.com/Courses-Offers.html¶

Please send us your comments/feedback at analyticseducator@gmail.com or call us at 9163223228 or WhatsApp us at 9804919166¶

Copyright © 2017 Analytics Educator