As it happens, there is a separate regression model that takes care of a situation where the output variable is a binary or categorical variable rather than a continuous variable. This model is called logistic regression. In other words, logistic regression is a variation of linear regression where the output variable is a binary or categorical variable. The two regressions are similar in the sense that they both assume a linear relationship between the predictor and output variables. However, as we will see soon, the output variable needs to undergo some transformation in the case of logistic regression.
A few scenarios where logistic regression can be applied are as follows:
To predict whether a random customer will buy a particular product or not, given his details such as income, gender, shopping history, and advertisement history To predict whether a team will win or lose a match, given the match and team details such as weather, form of players, stadium, and hours spent in training Note how the output variable in both the cases is a binary or categorical variable.
The sinking of the Titanic on April 15th, 1912 is one of the most tragic tragedies in history. The Titanic sank, during her maiden voyage, after colliding with an iceberg, killing 1502 out of 2224 passengers. The numbers of survivors were low due to the lack of lifeboats for all passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others, such as women, children, and upper-class. This case study analyzes what sorts of people were likely to survive this tragedy. The dataset includes the following:
We are importing all the required libraries
import pandasas pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# read the data using pandas dataframe
import os
os.chdir("D:\\Python case study 4")
train = pd.read_csv('Train_Titanic.csv')
# Show the data head!
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
The data has been properly imported.
train.drop(["PassengerId","Name","Ticket","Embarked"],axis=1,inplace=True)
train.head(2)
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 |
# EXPLORE/VISUALIZE DATASET# Let's count the number of survivors and non-survivors
train['Survived'].value_counts()
0 549 1 342 Name: Survived, dtype: int64
#Number of people travelling by different class
plt.figure(figsize=[10,5])
sns.countplot(x = 'Pclass', data = train)
<AxesSubplot:xlabel='Pclass', ylabel='count'>
plt.figure(figsize=[10,5])#### Observation : Here 1 means survived and 0 means died. Pretty much balanced data, since number of 1 and 0 are close.
sns.countplot(x = 'Pclass', hue = 'Survived', data=train)
<AxesSubplot:xlabel='Pclass', ylabel='count'>
plt.figure(figsize=[10,5])
sns.countplot(x = 'SibSp', hue = 'Survived', data=train)
<AxesSubplot:xlabel='SibSp', ylabel='count'>
plt.figure(figsize=[10,5])
sns.countplot(x = 'Parch', hue = 'Survived', data=train)
<AxesSubplot:xlabel='Parch', ylabel='count'>
plt.figure(figsize=[10,5])
sns.countplot(x = 'Sex', hue = 'Survived', data=train)
<AxesSubplot:xlabel='Sex', ylabel='count'>
plt.figure(figsize=(20,5))
sns.countplot(x = 'Age', hue = 'Survived', data=train)
<AxesSubplot:xlabel='Age', ylabel='count'>
# Age Histogram
train['Age'].hist(bins = 40)
<AxesSubplot:>
plt.figure(figsize=(80,40))
sns.countplot(x = 'Fare', hue = 'Survived', data=train)
<AxesSubplot:xlabel='Fare', ylabel='count'>
# Fare Histogram
train['Fare'].hist(bins = 40)
<AxesSubplot:>
# number of missing values by variables
train.isnull().sum()
Survived 0 Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 Cabin 687 dtype: int64
# percentage of missing values by variables
train.isnull().mean()*100
Survived 0.000000 Pclass 0.000000 Sex 0.000000 Age 19.865320 SibSp 0.000000 Parch 0.000000 Fare 0.000000 Cabin 77.104377 dtype: float64
# Let's visualize which variables in the dataset are missing
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
<AxesSubplot:>
# Dropping the Cabin column
train.drop('Cabin',axis=1,inplace=True)
train.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 |
# Let's view the missing values in the data one more time!
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
<AxesSubplot:>
#Mean of total Age
train.Age.mean()
29.736034227171306
# Let's get the average age for male (~29) and female (~25)
plt.figure(figsize=(15, 10))
sns.boxplot(x='Sex', y='Age',data=train)
<AxesSubplot:xlabel='Sex', ylabel='Age'>
#Shows the missing values of Age for male
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"].head()
5 NaN 17 NaN 26 NaN 29 NaN 36 NaN Name: Age, dtype: float64
#Shows the average age for male
train.loc[train["Sex"] == "male","Age"].mean()
30.72664459161148
#Replace missing age for male and female with average age of male and female respectively
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"] = train.loc[train["Sex"] == "male","Age"].mean()
train.loc[(train["Age"].isnull()) & (train["Sex"] == "female"),"Age"] = train.loc[train["Sex"] == "female","Age"].mean()
#Check again for missing values
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
# now there are no missing values
<AxesSubplot:>
train.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 |
train = pd.get_dummies(data=train, columns=['Sex'],drop_first=True)
train.head()
Survived | Pclass | Age | SibSp | Parch | Fare | Sex_male | |
---|---|---|---|---|---|---|---|
0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 1 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 |
2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | 0 |
3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 0 |
4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | 1 |
#Let's drop the target coloumn before we do train test split
X = train.drop('Survived',axis=1)
y = train['Survived']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
LogisticRegression(random_state=0)
y_predict_test = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_predict_test,y_test))
[[100 17] [ 17 45]]
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test))
precision recall f1-score support 0 0.85 0.85 0.85 117 1 0.73 0.73 0.73 62 accuracy 0.81 179 macro avg 0.79 0.79 0.79 179 weighted avg 0.81 0.81 0.81 179
#Import package and run the model
import statsmodels.api as sm
logit = sm.Logit(y_train, X_train)
# fit the model
result = logit.fit()
result.summary()#all p values are significant
Optimization terminated successfully. Current function value: 0.505715 Iterations 6
Dep. Variable: | Survived | No. Observations: | 712 |
---|---|---|---|
Model: | Logit | Df Residuals: | 706 |
Method: | MLE | Df Model: | 5 |
Date: | Mon, 28 Mar 2022 | Pseudo R-squ.: | 0.2454 |
Time: | 12:55:37 | Log-Likelihood: | -360.07 |
converged: | True | LL-Null: | -477.17 |
Covariance Type: | nonrobust | LLR p-value: | 1.343e-48 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Pclass | 0.1182 | 0.076 | 1.554 | 0.120 | -0.031 | 0.267 |
Age | 0.0101 | 0.006 | 1.647 | 0.100 | -0.002 | 0.022 |
SibSp | -0.3548 | 0.110 | -3.229 | 0.001 | -0.570 | -0.139 |
Parch | -0.1766 | 0.119 | -1.482 | 0.138 | -0.410 | 0.057 |
Fare | 0.0163 | 0.003 | 5.126 | 0.000 | 0.010 | 0.023 |
Sex_male | -2.2946 | 0.202 | -11.370 | 0.000 | -2.690 | -1.899 |
#Here we have removed the insignificant variables one at a time.
X_train1 = X_train.drop(["Parch","Pclass"],axis=1)
logit = sm.Logit(y_train, X_train1)
# fit the model
result = logit.fit()
result.summary()#all p values are significant
Optimization terminated successfully. Current function value: 0.508274 Iterations 6
Dep. Variable: | Survived | No. Observations: | 712 |
---|---|---|---|
Model: | Logit | Df Residuals: | 708 |
Method: | MLE | Df Model: | 3 |
Date: | Mon, 28 Mar 2022 | Pseudo R-squ.: | 0.2416 |
Time: | 12:55:37 | Log-Likelihood: | -361.89 |
converged: | True | LL-Null: | -477.17 |
Covariance Type: | nonrobust | LLR p-value: | 1.047e-49 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Age | 0.0147 | 0.005 | 3.044 | 0.002 | 0.005 | 0.024 |
SibSp | -0.3412 | 0.096 | -3.542 | 0.000 | -0.530 | -0.152 |
Fare | 0.0142 | 0.003 | 4.972 | 0.000 | 0.009 | 0.020 |
Sex_male | -2.1537 | 0.186 | -11.569 | 0.000 | -2.519 | -1.789 |
data = pd.concat([y_train,X_train1],axis=1)
data.head()
Survived | Age | SibSp | Fare | Sex_male | |
---|---|---|---|---|---|
57 | 0 | 28.500000 | 0 | 7.2292 | 1 |
717 | 1 | 27.000000 | 0 | 10.5000 | 0 |
431 | 1 | 27.915709 | 1 | 16.1000 | 0 |
633 | 0 | 30.726645 | 0 | 0.0000 | 1 |
163 | 0 | 17.000000 | 0 | 8.6625 | 1 |
### VIF Calculation
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
result=sm.ols(formula="Survived~Age+SibSp+Fare+Sex_male",
data=data).fit()
result.summary()# shows total summary
#remove variable based on vif
#all vif values are under 2, hence no variable is removed
var = pd.DataFrame(round(result.pvalues,3))# shows p value
var["coeff"] = result.params#coefficients
variables = result.model.exog #.if I had saved data as rock
# this it would have looked like rock.model.exog
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif
var["vif"] = vif
var
0 | coeff | vif | |
---|---|---|---|
Intercept | 0.000 | 0.774683 | 8.804167 |
Age | 0.073 | -0.002176 | 1.091592 |
SibSp | 0.000 | -0.073560 | 1.092640 |
Fare | 0.000 | 0.001684 | 1.073503 |
Sex_male | 0.000 | -0.522593 | 1.050660 |
y_predict_test1 = result.predict(X_test)
y_predict_test1 = np.where(y_predict_test1 >= 0.5,1,0)
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_predict_test1,y_test))
[[104 20] [ 13 42]]
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test1))
precision recall f1-score support 0 0.84 0.89 0.86 117 1 0.76 0.68 0.72 62 accuracy 0.82 179 macro avg 0.80 0.78 0.79 179 weighted avg 0.81 0.82 0.81 179