Data science has risen to prominence in the last decade due to its capabilities in predictive algorithms. While many business verticals value the benefits of predictive algorithms using Data Science, insurance companies place a lot of importance as data science and predictive algorithms helps them keeps premium low. Data is always been at the core of what insurance companies do analyzing data such as claims, what kind of a vehicle one drives, how many miles do they drive per day among other.
The data science field is gaining strength with improvements in technology, availability of statistical libraries to compute regression or classifications of data collected. Actuaries, the data scientists at insurance companies as they were called a decade ago, used to collate data from different sources and analyze the premium and claim data to identify fraudulent transactions that helped them keep the premiums low. If anything, data science technology of today has given far more tools to perform their analysis.
The data has a few ordinal, categorical data that needs to be parsed and categorized properly.
Our goal is to predict a binary outcome of 1, to indicate safe driver, or 0, to indicate that the drivers' data needs a review. We will also look at the continuous variables and fill in the missing data with the mean or median in order to not skew our results.
After cleaning up the data and filling in missing data we will look at the features and their correlation so that we can drop highly correlated data which may impact our results.
# Import the necessary packages of Python that we will/may use in this notebook
import pandas as pd
import numpy as np
import warnings
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix, fbeta_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report as cr
import matplotlib.pyplot as plt"ggplot")
import os
# Read the data from the local drive
safe_driver = pd.read_csv('safe_driver.csv')
ID | target | Gender | EngineHP | credit_history | Years_Experience | annual_claims | Marital_Status | Vehical_type | Miles_driven_annually | size_of_family | Age_bucket | EngineHP_bucket | Years_Experience_bucket | Miles_driven_annually_bucket | credit_history_bucket | State | |
0 | 1 | 1 | F | 522 | 656 | 1 | 0 | Married | Car | 14749.0 | 5 | <18 | >350 | <3 | <15k | Fair | IL |
1 | 2 | 1 | F | 691 | 704 | 16 | 0 | Married | Car | 15389.0 | 6 | 28-34 | >350 | 15-30 | 15k-25k | Good | NJ |
2 | 3 | 1 | M | 133 | 691 | 15 | 0 | Married | Van | 9956.0 | 3 | >40 | 90-160 | 15-30 | <15k | Good | CT |
3 | 4 | 1 | M | 146 | 720 | 9 | 0 | Married | Van | 77323.0 | 3 | 18-27 | 90-160 | 9-14' | >25k | Good | CT |
4 | 5 | 1 | M | 128 | 771 | 33 | 1 | Married | Van | 14183.0 | 4 | >40 | 90-160 | >30 | <15k | Very Good | WY |
Here we don't get to see any problem with the data
ID | target | EngineHP | credit_history | Years_Experience | annual_claims | Miles_driven_annually | size_of_family | |
count | 30240.000000 | 30240.00000 | 30240.000000 | 30240.000000 | 30240.000000 | 30240.000000 | 30232.000000 | 30240.000000 |
mean | 15120.500000 | 0.70754 | 196.604266 | 685.769775 | 13.255721 | 1.138459 | 17422.938939 | 4.521296 |
std | 8729.680407 | 0.45490 | 132.346961 | 102.454307 | 9.890246 | 1.082913 | 17483.782840 | 2.286531 |
min | 1.000000 | 0.00000 | 80.000000 | 300.000000 | 1.000000 | 0.000000 | 5000.000000 | 1.000000 |
25% | 7560.750000 | 0.00000 | 111.000000 | 668.000000 | 5.000000 | 0.000000 | 9668.500000 | 3.000000 |
50% | 15120.500000 | 1.00000 | 141.000000 | 705.000000 | 10.000000 | 1.000000 | 12280.000000 | 5.000000 |
75% | 22680.250000 | 1.00000 | 238.000000 | 753.000000 | 20.000000 | 2.000000 | 14697.250000 | 7.000000 |
max | 30240.000000 | 1.00000 | 1005.000000 | 850.000000 | 40.000000 | 4.000000 | 99943.000000 | 8.000000 |
We get to see that there are 2 variables with missing values
# Check if there are any NULL data that need to be dropped
ID 0.000000 target 0.000000 Gender 0.000000 EngineHP 0.000000 credit_history 0.000000 Years_Experience 0.000000 annual_claims 0.000000 Marital_Status 0.000000 Vehical_type 0.000000 Miles_driven_annually 0.026455 size_of_family 0.000000 Age_bucket 0.000000 EngineHP_bucket 0.000000 Years_Experience_bucket 0.000000 Miles_driven_annually_bucket 0.026455 credit_history_bucket 0.000000 State 0.000000 dtype: float64
#safe_driver = safe_driver.dropna()
ID | target | Gender | EngineHP | credit_history | Years_Experience | annual_claims | Marital_Status | Vehical_type | Miles_driven_annually | size_of_family | Age_bucket | EngineHP_bucket | Years_Experience_bucket | Miles_driven_annually_bucket | credit_history_bucket | State | |
0 | 1 | 1 | F | 522 | 656 | 1 | 0 | Married | Car | 14749.0 | 5 | <18 | >350 | <3 | <15k | Fair | IL |
1 | 2 | 1 | F | 691 | 704 | 16 | 0 | Married | Car | 15389.0 | 6 | 28-34 | >350 | 15-30 | 15k-25k | Good | NJ |
2 | 3 | 1 | M | 133 | 691 | 15 | 0 | Married | Van | 9956.0 | 3 | >40 | 90-160 | 15-30 | <15k | Good | CT |
3 | 4 | 1 | M | 146 | 720 | 9 | 0 | Married | Van | 77323.0 | 3 | 18-27 | 90-160 | 9-14' | >25k | Good | CT |
4 | 5 | 1 | M | 128 | 771 | 33 | 1 | Married | Van | 14183.0 | 4 | >40 | 90-160 | >30 | <15k | Very Good | WY |
We see it's almost a balanced data with proportion of 1 and 0 being 70% and 0% respectively.*100
1 70.753968 0 29.246032 Name: target, dtype: float64
cat_features = safe_driver.select_dtypes(include=['object'])
Index(['Gender', 'Marital_Status', 'Vehical_type', 'Age_bucket', 'EngineHP_bucket', 'Years_Experience_bucket', 'Miles_driven_annually_bucket', 'credit_history_bucket', 'State'], dtype='object')
Gender | Marital_Status | Vehical_type | Age_bucket | EngineHP_bucket | Years_Experience_bucket | Miles_driven_annually_bucket | credit_history_bucket | State | |
0 | F | Married | Car | <18 | >350 | <3 | <15k | Fair | IL |
1 | F | Married | Car | 28-34 | >350 | 15-30 | 15k-25k | Good | NJ |
Among the categorical variables we retain the following:
safe_driver.drop(['ID', 'EngineHP_bucket', 'Years_Experience_bucket',
'credit_history_bucket'], axis=1, inplace=True)
# Check if the dataset has any NaN values as these values will make our algorithms throw an exception
target 0 Gender 0 EngineHP 0 credit_history 0 Years_Experience 0 annual_claims 0 Marital_Status 0 Vehical_type 0 Miles_driven_annually 8 size_of_family 0 Age_bucket 0 State 0 dtype: int64
The Miles_driven_annually feature has some null values. Let us explore which particular cells have NaN and ingest them with the median data.
target | Gender | EngineHP | credit_history | Years_Experience | annual_claims | Marital_Status | Vehical_type | Miles_driven_annually | size_of_family | Age_bucket | State | |
1235 | 1 | F | 124 | 793 | 27 | 0 | Married | Truck | NaN | 3 | >40 | NJ |
7365 | 0 | F | 465 | 696 | 5 | 0 | Married | Truck | NaN | 8 | 18-27 | SD |
11464 | 1 | F | 137 | 787 | 18 | 1 | Married | Truck | NaN | 1 | >40 | CT |
18158 | 0 | F | 108 | 747 | 8 | 1 | Married | Truck | NaN | 1 | 18-27 | OR |
19795 | 1 | F | 121 | 774 | 19 | 0 | Married | Truck | NaN | 2 | 28-34 | NY |
25731 | 1 | F | 355 | 694 | 15 | 1 | Married | Truck | NaN | 5 | 28-34 | CT |
26512 | 1 | F | 109 | 743 | 40 | 0 | Married | Truck | NaN | 1 | >40 | OR |
27045 | 1 | F | 83 | 784 | 21 | 0 | Married | Truck | NaN | 1 | >40 | CT |
as all the NaN values are for Truck only. Let us look at the median of Miles_driven_annually by each vehicle type.¶safe_driver.head(2)
target | Gender | EngineHP | credit_history | Years_Experience | annual_claims | Marital_Status | Vehical_type | Miles_driven_annually | size_of_family | Age_bucket | State | |
0 | 1 | F | 522 | 656 | 1 | 0 | Married | Car | 14749.0 | 5 | <18 | IL |
1 | 1 | F | 691 | 704 | 16 | 0 | Married | Car | 15389.0 | 6 | 28-34 | NJ |
m = safe_driver.groupby("Vehical_type")["Miles_driven_annually"].median()
m = pd.DataFrame(m)
median_values = m.loc["Truck",]
median_values = pd.DataFrame(median_values)
# Replace NaN values in Miles_driven_annually with the median value for Truck
# There may be better ways to impute missing data. But we have just 8 NaN cells out of some 30,000+ rows which is
# less than 0.03%
# So, imputing with median for all the 8 cells is not going to skew our results.
#safe_driver.fillna(median_values.loc['Truck', 'Miles_driven_annually'], inplace=True)
safe_driver.loc[(safe_driver["Miles_driven_annually"].isnull() == True) & (safe_driver["Vehical_type"] == "Truck"),"Miles_driven_annually"] = median_values.iloc[0,0]
safe_driver.loc[safe_driver["Miles_driven_annually"] == 12370.5,]
target | Gender | EngineHP | credit_history | Years_Experience | annual_claims | Marital_Status | Vehical_type | Miles_driven_annually | size_of_family | Age_bucket | State | |
1235 | 1 | F | 124 | 793 | 27 | 0 | Married | Truck | 12370.5 | 3 | >40 | NJ |
7365 | 0 | F | 465 | 696 | 5 | 0 | Married | Truck | 12370.5 | 8 | 18-27 | SD |
11464 | 1 | F | 137 | 787 | 18 | 1 | Married | Truck | 12370.5 | 1 | >40 | CT |
18158 | 0 | F | 108 | 747 | 8 | 1 | Married | Truck | 12370.5 | 1 | 18-27 | OR |
19795 | 1 | F | 121 | 774 | 19 | 0 | Married | Truck | 12370.5 | 2 | 28-34 | NY |
25731 | 1 | F | 355 | 694 | 15 | 1 | Married | Truck | 12370.5 | 5 | 28-34 | CT |
26512 | 1 | F | 109 | 743 | 40 | 0 | Married | Truck | 12370.5 | 1 | >40 | OR |
27045 | 1 | F | 83 | 784 | 21 | 0 | Married | Truck | 12370.5 | 1 | >40 | CT |
target | Gender | EngineHP | credit_history | Years_Experience | annual_claims | Marital_Status | Vehical_type | Miles_driven_annually | size_of_family | Age_bucket | State |
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30240 entries, 0 to 30239 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 target 30240 non-null int64 1 Gender 30240 non-null object 2 EngineHP 30240 non-null int64 3 credit_history 30240 non-null int64 4 Years_Experience 30240 non-null int64 5 annual_claims 30240 non-null int64 6 Marital_Status 30240 non-null object 7 Vehical_type 30240 non-null object 8 Miles_driven_annually 30240 non-null float64 9 size_of_family 30240 non-null int64 10 Age_bucket 30240 non-null object 11 State 30240 non-null object dtypes: float64(1), int64(6), object(5) memory usage: 2.8+ MB
Looking at the feature values above, the range of values of each vary a lot. For example 'Miles_driven_annually'
is in the 10s of thousands, whereas 'credit_history' is in the 100s and 'annual-claims' is in single digit. Due to the varying magnitudes of the feature values we will scale the features with Z-scores using sklearn.preprocessing.scale
# To standardize the numeric features we need to isolate them first into a separate dataframe
safe_driver_num_features = safe_driver.drop(safe_driver.select_dtypes(['object']), axis=1)
# Do not standardize 'target' which is our label
safe_driver_num_features.drop(['target'], axis=1, inplace=True)
safe_driver_cat_features = safe_driver.select_dtypes(['object'])
EngineHP | credit_history | Years_Experience | annual_claims | Miles_driven_annually | size_of_family |
from sklearn import preprocessing
safe_driver_scaled = pd.DataFrame(preprocessing.scale(safe_driver_num_features),
# We will concatenate the scaled dataframe with the categorical feature set
safe_driver = pd.concat([safe_driver_scaled, safe_driver['target'], safe_driver_cat_features], axis=1)
EngineHP | credit_history | Years_Experience | annual_claims | Miles_driven_annually | size_of_family | target | Gender | Marital_Status | Vehical_type | Age_bucket | State | |
0 | 2.458697 | -0.290571 | -1.239193 | -1.051311 | -0.152883 | 0.209362 | 1 | F | Married | Car | <18 | IL |
1 | 3.735665 | 0.177938 | 0.277478 | -1.051311 | -0.116272 | 0.646712 | 1 | F | Married | Car | 28-34 | NJ |
char = safe_driver.select_dtypes(exclude='number')
num = safe_driver.select_dtypes(include='number')
Gender | Marital_Status | Vehical_type | Age_bucket | State | |
0 | F | Married | Car | <18 | IL |
1 | F | Married | Car | 28-34 | NJ |
2 | M | Married | Van | >40 | CT |
3 | M | Married | Van | 18-27 | CT |
4 | M | Married | Van | >40 | WY |
# this option will display all rows
pd.set_option('display.max_rows', None)
#W are extracting all the unique values of the categorical variables
char.apply(lambda x: x.value_counts()).T.stack()
Gender F 13881.0 M 16359.0 Marital_Status Married 19820.0 Single 10420.0 Vehical_type Car 11582.0 Truck 8798.0 Utility 4007.0 Van 5853.0 Age_bucket 18-27 8097.0 28-34 2056.0 35-40 6546.0 <18 911.0 >40 12630.0 State AK 205.0 AL 246.0 AR 255.0 AZ 225.0 CA 251.0 CO 272.0 CT 4444.0 DE 261.0 FL 251.0 GA 242.0 HI 225.0 IA 242.0 ID 251.0 IL 220.0 IN 241.0 KS 241.0 KY 248.0 LA 264.0 MA 284.0 MD 247.0 ME 248.0 MI 235.0 MN 242.0 MO 237.0 MS 220.0 MT 238.0 NC 221.0 ND 245.0 NE 222.0 NH 229.0 NJ 4884.0 NM 236.0 NV 239.0 NY 3686.0 OH 223.0 OK 260.0 OR 3838.0 PA 257.0 RI 242.0 SC 249.0 SD 229.0 TN 242.0 TX 233.0 UT 244.0 VA 252.0 VT 1429.0 WA 233.0 WI 271.0 WV 1253.0 WY 288.0 dtype: float64
# dropping the state column
char = char.drop("State",axis=1)
char = pd.get_dummies(data=char,drop_first=True)
safe_driver_num_features = pd.concat(
[safe_driver_num_features, safe_driver['target']], axis=1)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30240 entries, 0 to 30239 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EngineHP 30240 non-null int64 1 credit_history 30240 non-null int64 2 Years_Experience 30240 non-null int64 3 annual_claims 30240 non-null int64 4 Miles_driven_annually 30240 non-null float64 5 size_of_family 30240 non-null int64 6 target 30240 non-null int64 dtypes: float64(1), int64(6) memory usage: 1.6 MB
Here, below, we separate our feature set from the label target
and convert all the categorical variables to numeric. Then split the feature set into training and test data sets.
Let us convert some of the categorical features into numeric giving weightage to each variable.
or OneHotEncoder
because these create sparse matrices and increase dimensionality. By giving a 1 or a 2 for say Marital_Status we give higher weightage to Married
by assigning a value of 2.safe_driver.head(3)
EngineHP | credit_history | Years_Experience | annual_claims | Miles_driven_annually | size_of_family | target | Gender | Marital_Status | Vehical_type | Age_bucket | State | |
0 | 2.458697 | -0.290571 | -1.239193 | -1.051311 | -0.152883 | 0.209362 | 1 | F | Married | Car | <18 | IL |
1 | 3.735665 | 0.177938 | 0.277478 | -1.051311 | -0.116272 | 0.646712 | 1 | F | Married | Car | 28-34 | NJ |
2 | -0.480595 | 0.051050 | 0.176366 | -1.051311 | -0.427060 | -0.665340 | 1 | M | Married | Van | >40 | CT |
# Convert Gender to a 1 or a 2
safe_driver['Gender'] = np.where(safe_driver['Gender'] == 'F', 1, 2)
# Convert Marital_Status to a 1 or a 2
safe_driver['Marital_Status'] = np.where(
safe_driver['Marital_Status'] == 'Single', 1, 2)
# Convert Vehicle_Type using LabelEncoder
le = preprocessing.LabelEncoder()['Vehical_type'])
safe_driver['Vehical_type'] = le.transform(safe_driver['Vehical_type'])
# Convert Age_bucket using LabelEncoder['Age_bucket'])
safe_driver['Age_bucket'] = le.transform(safe_driver['Age_bucket'])
EngineHP | credit_history | Years_Experience | annual_claims | Miles_driven_annually | size_of_family | target | Gender | Marital_Status | Vehical_type | Age_bucket | State | |
0 | 2.458697 | -0.290571 | -1.239193 | -1.051311 | -0.152883 | 0.209362 | 1 | 1 | 2 | 0 | 3 | IL |
1 | 3.735665 | 0.177938 | 0.277478 | -1.051311 | -0.116272 | 0.646712 | 1 | 1 | 2 | 0 | 1 | NJ |
#panu = safe_driver.copy()
safe_driver = panu.copy()
# Drop the 'target' column from training dataframe as that is our label
X = safe_driver.drop(['target', 'State'], 1)
# The 'target' column is our label or outcome that we want to predict
y = safe_driver['target']
== 1) and 30% failure (bad driver or target
== 0). Let us do class balancing using SMOTE and see the distribution.from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
os = SMOTE(random_state=0)
columns = X.columns
os_data_X, os_data_y = os.fit_resample(X, y)
#os_data_X = pd.DataFrame(data=os_data_X, columns=columns)
#os_data_y = pd.DataFrame(data=os_data_y, columns=['y'])
X_train, X_test, y_train, y_test = train_test_split(os_data_X, os_data_y, test_size=0.3, random_state=0)
def dt():
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(random_state=0), y_train)
y_pred = classifier.predict(X_test)
# validation of the model
from sklearn.metrics import classification_report, confusion_matrix
#print(classifier.score(X_test, y_test))
r=classifier.score(X_test, y_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.66 0.68 0.67 6418 1 0.67 0.65 0.66 6420 accuracy 0.67 12838 macro avg 0.67 0.67 0.67 12838 weighted avg 0.67 0.67 0.67 12838
def rf():
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=1000,random_state=0), y_train)
y_pred = rf.predict(X_test)
# validation of the model
from sklearn.metrics import classification_report, confusion_matrix
r=rf.score(X_test, y_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.79 0.70 0.74 6418 1 0.73 0.81 0.77 6420 accuracy 0.76 12838 macro avg 0.76 0.76 0.76 12838 weighted avg 0.76 0.76 0.76 12838
def rf():
from sklearn import linear_model
clf = linear_model.SGDClassifier(max_iter=200, tol=1e-3), y_train)
y_pred = clf.predict(X_test)
# validation of the model
from sklearn.metrics import classification_report, confusion_matrix
r=clf.score(X_test, y_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.27 0.00 0.00 6418 1 0.50 1.00 0.67 6420 accuracy 0.50 12838 macro avg 0.39 0.50 0.33 12838 weighted avg 0.39 0.50 0.33 12838
def rf():
from sklearn.linear_model import RidgeClassifier
clf = RidgeClassifier(), y_train)
y_pred = clf.predict(X_test)
# validation of the model
from sklearn.metrics import classification_report, confusion_matrix
r=clf.score(X_test, y_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.50 0.49 0.50 6418 1 0.50 0.52 0.51 6420 accuracy 0.50 12838 macro avg 0.50 0.50 0.50 12838 weighted avg 0.50 0.50 0.50 12838
def rf():
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(), y_train)
y_pred = clf.predict(X_test)
# validation of the model
from sklearn.metrics import classification_report, confusion_matrix
r=clf.score(X_test, y_test)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.84 0.49 0.62 6418 1 0.64 0.91 0.75 6420 accuracy 0.70 12838 macro avg 0.74 0.70 0.68 12838 weighted avg 0.74 0.70 0.68 12838
def rf():
from xgboost import XGBClassifier
clf = XGBClassifier(), y_train)
y_pred = clf.predict(X_test)
# validation of the model
from sklearn.metrics import classification_report, confusion_matrix
r=clf.score(X_test, y_test)
print(classification_report(y_test, y_pred))
[18:18:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/ Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. precision recall f1-score support 0 0.90 0.58 0.70 6418 1 0.69 0.94 0.79 6420 accuracy 0.76 12838 macro avg 0.80 0.76 0.75 12838 weighted avg 0.80 0.76 0.75 12838
models = {
'Algorithm' : ['Decision Tree', 'Random Forest', 'Ridge',"Gradient Boosting","Extreme Gradient Boosting"],
'Accuracy' : [66.6, 75.6,50.3,69.8,75.8],
score = pd.DataFrame(models)
score = score.sort_values("Accuracy",ascending=0)
Algorithm | Accuracy | |
4 | Extreme Gradient Boosting | 75.8 |
1 | Random Forest | 75.6 |
3 | Gradient Boosting | 69.8 |
0 | Decision Tree | 66.6 |
2 | Ridge | 50.3 |
import seaborn as sns
import matplotlib.pyplot as plt
sns.barplot(x = 'Algorithm', y = 'Accuracy', data = score)
<AxesSubplot:xlabel='Algorithm', ylabel='Accuracy'>
The readers of this blog might mail their opinion how to further improve the model and you will get our contact details here.
If you want to read more such case studies then click on Whom should you ask for donations for a charity or Identify if a patient has cancer
Regression problems can be found at House Price Prediction and Insurance Premium Prediction