import pandasas pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


                    # read the data using pandas dataframe
import os
os.chdir("D:\\Python case study 4")
train = pd.read_csv('Train_Titanic.csv')


                    # Show the data head!
train.head()


                    train.drop(["PassengerId","Name","Ticket","Embarked"],axis=1,inplace=True)
train.head(2)


                    # EXPLORE/VISUALIZE DATASET# Let's count the number of survivors and non-survivors
train['Survived'].value_counts()

0    549
                    1    342
                    Name: Survived, dtype: int64


                    #Number of people travelling by different class
plt.figure(figsize=[10,5])
sns.countplot(x = 'Pclass', data = train)

<AxesSubplot:xlabel='Pclass', ylabel='count'>


                    plt.figure(figsize=[10,5])#### Observation :  Here 1 means survived and 0 means died. Pretty much balanced data, since number of 1 and 0 are close.
sns.countplot(x = 'Pclass', hue = 'Survived', data=train)

<AxesSubplot:xlabel='Pclass', ylabel='count'>


                    plt.figure(figsize=[10,5])
sns.countplot(x = 'SibSp', hue = 'Survived', data=train)

<AxesSubplot:xlabel='SibSp', ylabel='count'>


                    plt.figure(figsize=[10,5])
sns.countplot(x = 'Parch', hue = 'Survived', data=train)

<AxesSubplot:xlabel='Parch', ylabel='count'>


                    plt.figure(figsize=[10,5])
sns.countplot(x = 'Sex', hue = 'Survived', data=train)

<AxesSubplot:xlabel='Sex', ylabel='count'>


                    plt.figure(figsize=(20,5))
sns.countplot(x = 'Age', hue = 'Survived', data=train)

<AxesSubplot:xlabel='Age', ylabel='count'>


                    # Age Histogram 
train['Age'].hist(bins = 40)

<AxesSubplot:>


                    plt.figure(figsize=(80,40))
sns.countplot(x = 'Fare', hue = 'Survived', data=train)

<AxesSubplot:xlabel='Fare', ylabel='count'>


                    # Fare Histogram 
train['Fare'].hist(bins = 40)

<AxesSubplot:>


                    # number of missing values by variables
train.isnull().sum()

Survived      0
                    Pclass        0
                    Sex           0
                    Age         177
                    SibSp         0
                    Parch         0
                    Fare          0
                    Cabin       687
                    dtype: int64


                    # percentage of missing values by variables
train.isnull().mean()*100

Survived     0.000000
                    Pclass       0.000000
                    Sex          0.000000
                    Age         19.865320
                    SibSp        0.000000
                    Parch        0.000000
                    Fare         0.000000
                    Cabin       77.104377
                    dtype: float64


                    # Let's visualize which variables in the dataset are missing
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")

<AxesSubplot:>


                    # Dropping the Cabin column
train.drop('Cabin',axis=1,inplace=True)


                    train.head()


                    # Let's view the missing values in the data one more time!
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")

<AxesSubplot:>


                    #Mean of total Age
train.Age.mean()

29.736034227171306


                    # Let's get the average age for male (~29) and female (~25)
plt.figure(figsize=(15, 10))
sns.boxplot(x='Sex', y='Age',data=train)

<AxesSubplot:xlabel='Sex', ylabel='Age'>


                    #Shows the missing values of Age for male
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"].head()

5    NaN
                    17   NaN
                    26   NaN
                    29   NaN
                    36   NaN
                    Name: Age, dtype: float64


                    #Shows the average age for male
train.loc[train["Sex"] == "male","Age"].mean()

30.72664459161148


                    #Replace missing age for male and female with average age of male and female respectively
train.loc[(train["Age"].isnull()) & (train["Sex"] == "male"),"Age"] = train.loc[train["Sex"] == "male","Age"].mean()
train.loc[(train["Age"].isnull()) & (train["Sex"] == "female"),"Age"] = train.loc[train["Sex"] == "female","Age"].mean()


                    #Check again for missing values 
sns.heatmap(train.isnull(), yticklabels = False, cbar = False, cmap="Blues")
# now there are no missing values

<AxesSubplot:>


                    train.head()


                    train = pd.get_dummies(data=train, columns=['Sex'],drop_first=True)


                    train.head()


                    #Let's drop the target coloumn before we do train test split
X = train.drop('Survived',axis=1)
y = train['Survived']


                    from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)


                    # Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

LogisticRegression(random_state=0)


                    y_predict_test = classifier.predict(X_test)


                    from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_predict_test,y_test))

[[100  17]
                     [ 17  45]]


                    from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test))

              precision    recall  f1-score   support
                    
                               0       0.85      0.85      0.85       117
                               1       0.73      0.73      0.73        62
                    
                        accuracy                           0.81       179
                       macro avg       0.79      0.79      0.79       179
                    weighted avg       0.81      0.81      0.81       179


                    #Import package and run the model
import statsmodels.api as sm
logit = sm.Logit(y_train, X_train)
# fit the model
result = logit.fit()
result.summary()#all p values are significant

Optimization terminated successfully.
                             Current function value: 0.505715
                             Iterations 6


                    #Here we have removed the insignificant variables one at a time.
X_train1 = X_train.drop(["Parch","Pclass"],axis=1)


                    logit = sm.Logit(y_train, X_train1)
# fit the model
result = logit.fit()
result.summary()#all p values are significant

Optimization terminated successfully.
                             Current function value: 0.508274
                             Iterations 6


                    data = pd.concat([y_train,X_train1],axis=1)
data.head()


                    ### VIF Calculation
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
                    
                    
result=sm.ols(formula="Survived~Age+SibSp+Fare+Sex_male",
             data=data).fit()
result.summary()# shows total summary

#remove variable based on vif
#all vif values are under 2, hence no variable is removed

var = pd.DataFrame(round(result.pvalues,3))# shows p value
var["coeff"] = result.params#coefficients
variables = result.model.exog #.if I had saved data as rock
# this it would have looked like rock.model.exog
vif = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif 
var["vif"] = vif
var


                    y_predict_test1 = result.predict(X_test)


                    y_predict_test1 = np.where(y_predict_test1 >= 0.5,1,0)


                    from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_predict_test1,y_test))

[[104  20]
                     [ 13  42]]


                    from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict_test1))

              precision    recall  f1-score   support
                    
                               0       0.84      0.89      0.86       117
                               1       0.76      0.68      0.72        62
                    
                        accuracy                           0.82       179
                       macro avg       0.80      0.78      0.79       179
                    weighted avg       0.81      0.82      0.81       179

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	Survived	Pclass	Age	SibSp	Fare	Sex_male
0	0	3	22.0	1	7.2500	1
1	1	1	38.0	1	71.2833	0
2	1	3	26.0	0	7.9250	0
3	1	1	35.0	1	53.1000	0
4	0	3	35.0	0	8.0500	1

Dep. Variable:	Survived	No. Observations:	712
Model:	Logit	Df Residuals:	706
Method:	MLE	Df Model:	5
Date:	Mon, 28 Mar 2022	Pseudo R-squ.:	0.2454
Time:	12:55:37	Log-Likelihood:	-360.07
converged:	True	LL-Null:	-477.17
Covariance Type:	nonrobust	LLR p-value:	1.343e-48

	coef	std err	z	P>\|z\|	[0.025	0.975]
Pclass	0.1182	0.076	1.554	0.120	-0.031	0.267
Age	0.0101	0.006	1.647	0.100	-0.002	0.022
SibSp	-0.3548	0.110	-3.229	0.001	-0.570	-0.139
Parch	-0.1766	0.119	-1.482	0.138	-0.410	0.057
Fare	0.0163	0.003	5.126	0.000	0.010	0.023
Sex_male	-2.2946	0.202	-11.370	0.000	-2.690	-1.899

Dep. Variable:	Survived	No. Observations:	712
Model:	Logit	Df Residuals:	708
Method:	MLE	Df Model:	3
Date:	Mon, 28 Mar 2022	Pseudo R-squ.:	0.2416
Time:	12:55:37	Log-Likelihood:	-361.89
converged:	True	LL-Null:	-477.17
Covariance Type:	nonrobust	LLR p-value:	1.047e-49

	Survived	Age	SibSp	Fare	Sex_male
57	0	28.500000	0	7.2292	1
717	1	27.000000	0	10.5000	0
431	1	27.915709	1	16.1000	0
633	0	30.726645	0	0.0000	1
163	0	17.000000	0	8.6625	1

	0	coeff	vif
Intercept	0.000	0.774683	8.804167
Age	0.073	-0.002176	1.091592
SibSp	0.000	-0.073560	1.092640
Fare	0.000	0.001684	1.073503
Sex_male	0.000	-0.522593	1.050660

	coef	std err	z	P>\|z\|	[0.025	0.975]
Age	0.0147	0.005	3.044	0.002	0.005	0.024
SibSp	-0.3412	0.096	-3.542	0.000	-0.530	-0.152
Fare	0.0142	0.003	4.972	0.000	0.009	0.020
Sex_male	-2.1537	0.186	-11.569	0.000	-2.519	-1.789

Building a Logistic regression in Python, step by step, using Titanic Data¶

Logistic Regression¶

PROBLEM STATEMENT¶

DATA SOURCE: https://www.kaggle.com/c/titanic¶

Methodology¶

We have 2 datasets - train and test. We will build the model on the train data and then test the model on test to check the accuracy.¶

First we will do exploratory data analysis, followed by feature engineering, outlier and missing value treatment.¶

Then we will run logistic regression using both StatsModel and Scikit Learn (sklearn) and compare their accuracy, to judge which provides a better output¶

IMPORT LIBRARIES¶

IMPORT DATASET¶

EXPLORE/VISUALIZE DATASET¶

Observation : Here 1 means survived and 0 means died. Pretty much balanced data, since number of 1 and 0 are close.¶

Observation : Maximum people were travelling by 3rd class¶

Observation : More people travelling by 1st class survived; it's almost equal for 2nd class (marginally more people die though), and most of the people travelling by 3rd class died. This chart shows that the class had some impact whether a person would survive or not.¶

Now the same analysis will be made on other independent variables¶

Observation : Bar Chart to indicate the number of people survived based on their siblings status. If you have 1 siblings (SibSp = 1), you have a higher chance of survival compared to being alone (SibSp = 0)¶

Observation : Bar Chart to indicate the number of people survived based on their Parch status (how many parents onboard). If you have 1, 2, or 3 family members (Parch = 1,2), you have a higher chance of survival compared to being alone (Parch = 0)¶

Observation : Bar Chart to indicate the number of people survived based on their sex. If you are a female, you have a higher chance of survival compared to other ports!¶

Female and children were given first preference for safety, hence it makes sense that gender will help to predict the chances of survival.¶

Observation : Bar Chart to indicate the number of people survived based on their age. If you are a baby, you have a higher chance of survival¶

Female and children were given first preference for safety, hence it makes sense that age will help to predict the chances of survival.¶

Observation : The histogram shows that majorly people were around the age of 20 to 30¶

Observation : # Bar Chart to indicate the number of people survived based on their fare. If you pay a higher fare, you have a higher chance of survival¶

Observation : # Mostly people had paid low value fare. Only a handful of people had paid high fare.¶

PREPARE THE DATA FOR TRAINING/ DATA CLEANING¶

Observation: There are missing values for Age and Cabin¶

Observation: Missing values are shown in percentage. 77% of total values in Cabin is missing, and 19.8% values of Age is missing.¶

Observation: Since a very high percentage of values in Cabin are missing, this variable is not going to help us in the model. We will drop it from the dataset¶

Observation: There are 19.8% of missing values in Age, we can't entirely drop the column, nor we can keep the missing values. We will replace them with the average. However, we can't replace all the missing values with the average of Age. It would be misleading.¶

The mean of total Age could would be:¶

Observation: If we check the average age for different sex (male and female), we can see they are different values.¶

Hence, we should be replacing the missing Age values for female with average age of female and replace the missing Age values for male with average age of male.¶

Now there are no missing values in the dataset. We have completed the data cleaning.¶

Create Dummy variables¶

We need to create the dummy variables for all the categorical variables.¶

Observation : We can see that here we have only one categorical variable, i.e. Sex. We will create the dummy variable as shown below:¶

Data split¶

We will now split the data into dependent (y) and independent variable (X)¶

Now we will split the data into training (80% of the data) and rest 20% - named test, will be kept aside for later use.¶

MODEL TRAINING using Scikit Learn¶

First we will run the Multiple Logistic regression model using the Scikit Learn package. This is one of the popular package of Python to run Logistic Regression. Here, we are not required to check the p values, confidence intervals or multicollinearity. All will be taken care of by Python¶

MODEL TESTING¶

Once the model is executed, we will predict the test data with our model.¶

Now we will check the confusion matrix and accuracy of the model¶

Overall accuracy is 81% and precision for 0 and 1 are 85% and 73% respectively.¶

MODEL TRAINING using Statsmodel¶

Statsmodel is a very important package for machine learning algorithm like Logistic Regression. This enables us to understand the correlation of the different independent variables with the dependent variable.¶

It also provides different model validation parameters to check and fine-tune the model.¶

We will remove all the variables with p value (P>|z| column) more than 0.05. If there are more than one such variable then we will remove them from the model one at a time, based on lowest absolute z value (z column) to be removed first.¶

Then we will check whether all confidence intervals ([0.025 0.975] columns) are of same sign. If not then we should remove them from the model.¶

Here we will check multicollinearity with VIF values.¶

Logistic regression doesn't have the function to check VIF, hence we will run linear regression to check VIF¶

Here we need to check if any value under vif column crosses the threshold limit of 2 (some projects also consider it as more than 2 or less than 2). However, Intercept value should be ignored. Model becomes biased without intercept.¶

We don't have vif value more than 2, hence we will keep all the variables.¶

Now we will predict the values of test data and check the accuracy¶

Overall accuracy is 82% and precision for 0 and 1 are 84% and 76% respectively.¶

We can see that the statsmodel has outperformed the scikit learn in terms of overall accuracy and precision.¶

We can conclude that we should prefer statsmodel over scikitlearn, in case of Binary Logistic Regression, in order to achieve higher accuracy.¶

Sign off note : We have only used Logistic Regression as our classification technique. There are other classification techniques (ANN, XGB, KNN, SVM etc) as well, which may be used and compare the results to check for higher accuracy.¶

¶

Please send us your comments/feedback at analyticseducator@gmail.com or call us at 9163223228 or WhatsApp us at 9804919166¶

DATA SOURCE: https://www.kaggle.com/c/titanic ¶