Kaggle: Titanic Disaster (Top 3%)

02 Feb 2020

Reading time ~33 minutes

Titanic Disaster Study

Table of Content

1. Introduction
2. Data Import
    2.1 Import Libraries
    2.2 Load specific tools
    2.3 Data Import
    2.4 Data Inspection
3. Data Exploration and Data Cleaning
    3.1 Pivoting Features
    3.2 Embarked Feature
    3.3 Fare Feature
    3.4 Cabin Feature
    3.5 Age Feature
4. Data Visualization and Feature Exploration
    4.1 Gender
    4.2 Age
    4.3 Pclass and Fare
    4.4 SibSp & Parch
    4.5 Embarked
5. Statistical Study
    5.1 Main Features
    5.2 Correlation Study
6. Feature Preparation
    6.1 Passenger Title
    6.2 Family Size
    6.3 Tickets
    6.4 Family Survival Rate
    6.4 Fare Binning
    6.4 Age Binning
    6.4 Encoding
7. Model Preparation
8. Models
9. Best Models
10. Create Submission

1. Introduction

On April 15, 1912, the Titanic sunk after colliding with an iceberg, 1502 out of 2224 passengers and crew members died. The dataset containing passenger information has been made available. The purpose of this Notebook is to perform a comparison study of different models aimed at predicting survival rate. The data is obtained from Kaggle.

2. Data Import

2.1 Import Libraries

# Load libraries
import sys
print("Python version:\t\t{}".format(sys.version))
import pandas as pd
print("pandas version:\t\t{}".format(pd.__version__))
import matplotlib
print("matplotlib version:\t{}".format(matplotlib.__version__))
import numpy as np
print("numpy version:\t\t{}".format(np.__version__))
import scipy as sp
print("scipy version:\t\t{}".format(sp.__version__))
import sklearn
print("sklearn version:\t{}".format(sklearn.__version__))

Python version:		3.7.6 (default, Jan  8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
pandas version:		0.25.3
matplotlib version:	3.1.1
numpy version:		1.18.1
scipy version:		1.3.2
sklearn version:	0.22.1

2.2 Load specific tools

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook', palette='deep')
%matplotlib inline

# Models
from sklearn.model_selection import cross_val_score,GridSearchCV, StratifiedKFold, train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, ExtraTreeClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier, RidgeClassifier, SGDClassifier, Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier, VotingClassifier, ExtraTreesClassifier, RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.decomposition import PCA

from xgboost import XGBClassifier

# Tools
from sklearn.metrics import mean_squared_error, roc_auc_score, roc_curve, auc
from sklearn.metrics import accuracy_score,classification_report, precision_recall_curve, confusion_matrix
from sklearn.model_selection import cross_val_score, GridSearchCV, learning_curve

from collections import Counter
from tqdm import tqdm
import string


# Do not show warnings (added after Notebook was finalized)
import warnings
warnings.filterwarnings('ignore')

2.3 Data Import

We import both the training and test sets, we then combine them to compute statistics on the entire population of passengers.

# Import training and testing csv datasets
train = pd.read_csv('./Data/train.csv')
test = pd.read_csv('./Data/test.csv')

# Inspect data
train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# Inspect data
test.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

# tools functions: concat and divide
def concat_df(train, test):
    return pd.concat([train, test], sort=True).reset_index(drop=True)


def divide_df(full_df):
    return full_df.loc[:890], full_df.loc[891:].drop(['Survived'], axis=1)


df_all = concat_df(train, test)
# create a clone to test different transformations
df_exp = df_all.copy()

2.4 Data Inspection

The data is divided into two separate datasets:

a training set containing a set of features and out target variable (whether or not a passenger survived)
a test set containing only the set of features

Data Features
. Pclass: Categorical feature used to describe the passenger class (1=Upper, 2=Middle, 3=Lower).
. Name: String Containing a passenger name and title.
. Sex: Categorical variable describing the passenger’s gender.
. Age: Numerical feature standing for the passenger’s age.
. SibSp: Number of siblings/spouses aboard.
. Parch: Number of parents/children aboard.
. Ticket: Ticket number.
. Fare: Price of the ticket.
. Cabin: Cabin id.
. Embarked: Categorical feature, port of embarkation.

Target:
. Survived: Target feature (1=Survived, 0=Died)

3. Data Exploration and Data Cleaning

# Remove the Passengerid from the set as it does not need to be included in the models
passengerId = test.PassengerId

train = train.drop(['PassengerId'],axis=1)
test = test.drop(['PassengerId'],axis=1)

# list information for each feature (type, number of nun-null records)
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 76.7+ KB

# list information for each feature (type, number of nun-null records)
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 32.8+ KB

Comment:
Several of the features in the training set appear to be incomplete (Age, Cabin, and Embarked).
Several of the features in the test set appear to be incomplete (Age, Cabin, and Fare).

# compute percentage of missing values

# compute number of missing records
missing_total = train.isnull().sum().sort_values(ascending=False)

# convert to percentages
missing_percentage = missing_total/train.shape[0]*100

# display missing record %
print('Missing values in training set:')
pd.concat([missing_total,missing_percentage],keys=['Count','Percentage'],axis=1)

Missing values in training set:

	Count	Percentage
Cabin	687	77.104377
Age	177	19.865320
Embarked	2	0.224467
Fare	0	0.000000
Ticket	0	0.000000
Parch	0	0.000000
SibSp	0	0.000000
Sex	0	0.000000
Name	0	0.000000
Pclass	0	0.000000
Survived	0	0.000000

Comment:
Based on the above table, the following observations can be made:

The cabin feature is mostly empty, this will be hard to use.
The age feature contains a large number of missing values. This will require a smarter approach rather than just filling the null with a median.
The embarked feature only has 2 missing values. We can come up with estimates for these two by taking a quick look at the data and using the most probable values as replacements.

# compute percentage of missing values

# compute number of missing records
missing_total = test.isnull().sum().sort_values(ascending=False)

# convert to percentages
missing_percentage = missing_total/train.shape[0]*100

# display missing record %
print('Missing values in test set:')
pd.concat([missing_total,missing_percentage],keys=['Count','Percentage'],axis=1)

Missing values in test set:

	Count	Percentage
Cabin	327	36.700337
Age	86	9.652076
Fare	1	0.112233
Embarked	0	0.000000
Ticket	0	0.000000
Parch	0	0.000000
SibSp	0	0.000000
Sex	0	0.000000
Name	0	0.000000
Pclass	0	0.000000

Comment:
Based on the above table, the following observations can be made:

The cabin feature is also mostly empty, this will be hard to use.
The age feature contains a large number of missing values. This will require a smarter approach rather than just filling the null with a median.
The fare feature only has 1 missing values. We can come up with an estimate for this by taking a quick look at the data and using the most probable values as a replacement.

3.1 Pivoting features

Before diving in the data, it is interesting to get a quick overview of what is waiting for us. To do so, we can pivot several of the features with the target data and quickly identify which features seem important.

train[['Sex','Survived']].groupby(['Sex'],as_index=False).agg(['mean','count'])

	Survived
	mean	count
Sex
female	0.742038	314
male	0.188908	577

train[['Pclass','Survived']].groupby(['Pclass'],as_index=False).agg(['mean','count'])

	Survived
	mean	count
Pclass
1	0.629630	216
2	0.472826	184
3	0.242363	491

train[['SibSp','Survived']].groupby(['SibSp'],as_index=False).agg(['mean','count'])

	Survived
	mean	count
SibSp
0	0.345395	608
1	0.535885	209
2	0.464286	28
3	0.250000	16
4	0.166667	18
5	0.000000	5
8	0.000000	7

train[['Parch','Survived']].groupby(['Parch'],as_index=False).agg(['mean','count'])

	Survived
	mean	count
Parch
0	0.343658	678
1	0.550847	118
2	0.500000	80
3	0.600000	5
4	0.000000	4
5	0.200000	5
6	0.000000	1

In conclusion:

Gender, Females have a higher change of survival over men (74% vs. 19%)
Pclass, the survival rate is strongly correlated with the passenger class
SibSp, Parch, smaller family tends to have a higher survival rate

3.2 Age

The age feature is likely important for the predictions, however, it contains a large number of missing values. We need to come up with a strategy to fill the missing values. Let’s start by computing the feature correlations to determine which feature is highly correlated to the age of the passengers.

# compute all correlations
df_all_corr = df_all.corr().abs()

# display correlations with age
print("Correlation with Age feature")
df_all_corr['Age'].sort_values(ascending=False)[1:]

Correlation with Age feature

Pclass         0.408106
SibSp          0.243699
Fare           0.178740
Parch          0.150917
Survived       0.077221
PassengerId    0.028814
Name: Age, dtype: float64

As listed above, the Class feature is fairly well correlated to the Age of the passenger. However, the listed correlation do not include the effects of binary features such as the gender. We can breakdown the correlations to include sub-divisions.

df_exp['Cat'] = df_exp['Sex'].astype(str) + df_exp['Pclass'].astype(str)
df_exp.groupby(['Cat']).median()['Age']

Cat
female1    36.0
female2    28.0
female3    22.0
male1      42.0
male2      29.5
male3      25.0
Name: Age, dtype: float64

From the above, we can see that for each class women seem to be younger than men. In addition, we can see that older passengers travels in higher class.

# fill missing age with groups Sex + PClass
df_all['Age'] = df_all.groupby(['Sex', 'Pclass'
                               ])['Age'].apply(lambda x: x.fillna(x.median()))

assert df_all['Age'].isnull().sum()==0

The age feature is now complete.

3.3 Embarked Feature

The embarked feature has missing values in the training set.

# Distribution of the data
train['Embarked'].value_counts(dropna=False)

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64

We now inspect the rest of the records for these two missing values:

train[train['Embarked'].isnull()]

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

Let’s group the passenger with similar features:

Female
1st Class
Fare between 50 and 100

df_all[(df_all['Pclass'] == 1) & (df_all['Sex'] == 'female') &
       (df_all['Fare'] >= 50) & (df_all['Fare'] <= 100)].groupby(
           ['Embarked']).count()['Age']

Embarked
C    30
Q     2
S    35
Name: Age, dtype: int64

From the above, the most common embarked location is “S”. We will replace the missing values with this.

# Fill missing values
df_all['Embarked'].fillna('S',inplace=True)

3.4 Fare Feature

The test set has a missing value for the Fare feature.

# Display record corresponding to missing value
df_all[df_all.Fare.isnull()]

	Age	Cabin	Embarked	Fare	Name	Parch	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
1043	60.5	NaN	S	NaN	Storey, Mr. Thomas	0	1044	3	male	0	NaN	3701

We will use the median Fare of the subset corresponding to Pclass=3, Embarked=’S’, Sex=’male’, Age>=21 (for adult).

# Extract median
subset_med = df_all[(df_all['Pclass'] == 3) & (df_all['SibSp'] == 0) &
                  (df_all['Parch'] == 0)]['Fare'].median()

# Replace missing value
df_all['Fare'] = df_all['Fare'].fillna(subset_med)

3.5 Cabin Feature

The cabin feature is missing for 77% of the training set and 78% of the test set. With such a high percentage, the feature can either be dropped or feature engineering can be used to understand how the cabin id is defined. We opt for the second option.

# Feature inspection
df_all['Cabin'].sort_values().head(5)

   A10
  A11
   A14
   A16
  A18
Name: Cabin, dtype: object

By doing some research on the ship, the Cabin value contains the following information:

One letter standing for the boat deck
One number standing for the cabin number

source: Wikipedia

# Extract deck name from cabine feature (replace with "M" (missing) if Cabin is null)
df_all['Deck'] = df_all['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else "M")

df_plot = df_all.groupby(['Deck', 'Pclass']).size().reset_index().pivot(columns='Pclass', index='Deck', values=0)
df_plot.fillna(0)

Pclass	1	2	3
Deck
A	22.0	0.0	0.0
B	65.0	0.0	0.0
C	94.0	0.0	0.0
D	40.0	6.0	0.0
E	34.0	4.0	3.0
F	0.0	13.0	8.0
G	0.0	0.0	5.0
M	67.0	254.0	693.0
T	1.0	0.0	0.0

df_plot = df_plot.div(df_plot.sum(axis=1), axis=0)

fig, ax = plt.subplots(figsize=(16,6))
#ax = sns.countplot(x='Deck', data=df_all, hue='Pclass')
df_plot.plot(kind='bar', stacked=True, ax = ax)
ax.set_title('Class Distribution per Decks', fontsize=15)
ax.set_ylabel("Fraction of deck")
ax.legend()
ax.set_ylim(0,1);

The following observations can be made from the above plot:

Decks A, B, C are dedicated to the 1st class
Decks D and E are mostly assigned to the 1st class
Deck F is split for class 2 and 3
Deck G is for the first class only
Only one passenger is assigned to Deck T, this must be a mistake.

# move T Deck passenger to Deck A
idx = df_all[df_all['Deck'] == 'T'].index
df_all.loc[idx, 'Deck'] = 'A'

df_plot, _ = divide_df(df_all)
df_plot = df_plot.groupby(['Deck', 'Survived']).size().reset_index().pivot(
    columns='Survived', index='Deck').fillna(0).astype(int)

df_plot = df_plot.div(df_plot.sum(axis=1), axis=0)

fig, ax = plt.subplots(figsize=(16,6))
#ax = sns.countplot(x='Deck', data=df_all, hue='Pclass')
df_plot.plot(kind='bar', stacked=True, ax = ax, color=['r', 'b'])
ax.set_title('Survival Rate per Decks', fontsize=15)
ax.set_ylabel("Survival Rate")
ax.legend()
ax.set_ylim(0,1);

We can now group decks based on the survival rate and the class distribution.

We group:

A, B, C since they exclusively contain 1st class passengers.
D and E as they mostly contain 1st class passengers.
F and G as they contain 2nd and 3rd class while having as similar survival rate.

df_all['Deck'] = df_all['Deck'].replace(['A', 'B', 'C'], 'ABC')
df_all['Deck'] = df_all['Deck'].replace(['D', 'E'], 'DE')
df_all['Deck'] = df_all['Deck'].replace(['F', 'G'], 'FG')

# drop cabin
df_all = df_all.drop(['Cabin'], axis=1)

At this point, we have filled all the missing values.

train, test = divide_df(df_all)

4. Data Visualization and Feature Exploration

Before we implement the full model, it is important to inspect the data and answers a few basic questions. This will help understanding how the data is distributed but also will provide useful input used in our models.

Based on the famous rule “Women and children first”, we expect the gender and age to be strongly correlated with the survival rate.

Questions:

Gender: Is the survival rate higher for females?
Age: Is the survival rate higher for young passengers?
Pclass & Fare: Is the survival rate higher amongst wealthy passengers?
SibSp & Parch: Is the survival rate for family is higher than the one for single passenger?

4.1 Gender

# Plot the box plot
pal = {'female':"salmon",'male':"skyblue"}
plt.subplots(figsize = (6,6))
ax = sns.barplot(x = "Sex", y = "Survived", data=train, palette = pal)
plt.title("Impact of gender on the survival rate",fontsize=15)
plt.ylabel("Survival rate",fontsize=15)
plt.xlabel("Sex",fontsize=15);

print("Male mean survival rate = \t {:.2f}%".format(train[train['Sex'] =='male']['Survived'].mean()*100))
print("Female mean survival rate = \t {:.2f}%".format(train[train['Sex'] != 'male']['Survived'].mean()*100))

Male mean survival rate = 	 18.89%
Female mean survival rate = 	 74.20%

Comment
Based on the boxplot, the gender appears to be a critical feature when it comes to determining the faith of a passenger. Indeed, females seem to have on average a much higher survival rate.

4.2 Age

# Plot kernel density plot
fig, ax = plt.subplots(figsize=(16,6))
ax = sns.kdeplot(train.loc[(train['Survived']==1),'Age'],shade=True,color='blue',label='Survived');
ax = sns.kdeplot(train.loc[(train['Survived']==0),'Age'],shade=True,color='red',label='Not Survived')
ax.set_ylabel('Frequency',fontsize=15)
ax.set_xlabel('Age',fontsize=15);

print("Toddler survival rate = \t{:.2f}%".format(train[train['Age'] <= 2]['Survived'].mean()*100,2))
print("Children survival rate = \t{:.2f}%".format(train[(train['Age'] > 2) & (train['Age'] <= 12)]['Survived'].mean()*100,2))
print("Teenager survival rate = \t{:.2f}%".format(train[(train['Age'] > 12) & (train['Age'] <= 18)]['Survived'].mean()*100,2))
print("Young Adult survival rate = \t{:.2f}%".format(train[(train['Age'] > 18) & (train['Age'] <= 34)]['Survived'].mean()*100,2))
print("Adult survival rate = \t\t{:.2f}%".format(train[(train['Age'] > 34) & (train['Age'] <= 50)]['Survived'].mean()*100,2))
print("pre-senior survival rate = \t{:.2f}%".format(train[(train['Age'] > 50) & (train['Age'] <= 70)]['Survived'].mean()*100,2))
print("senior survival rate = \t\t{:.2f}%".format(train[train['Age'] > 70]['Survived'].mean()*100,2))

Toddler survival rate = 	62.50%
Children survival rate = 	55.56%
Teenager survival rate = 	42.86%
Young Adult survival rate = 	33.74%
Adult survival rate = 		42.57%
pre-senior survival rate = 	35.59%
senior survival rate = 		20.00%

Comment
Based on the distribution plot, the age also appears to be a critical feature. Young children have a much higher survival rate on average than the rest of the passenger. The survival rate tends to decrease with the age.

4.3 Pclass and Fare

# Plot the survival rate per class
pal = {1:"gold",2:"silver",3:'sandybrown'}
plt.subplots(figsize = (6,6))
ax = sns.barplot(x = "Pclass", y = "Survived", data=train, palette = pal)
plt.title("Impact of class on the survival rate",fontsize=15)
plt.ylabel("Survival rate",fontsize=15)
plt.xlabel("Class",fontsize=15);

print("Upper class survival rate =\t{:.2f}%".format(train[train['Pclass'] == 1]['Survived'].mean()*100))
print("Middle class survival rate =\t{:.2f}%".format(train[train['Pclass'] == 2]['Survived'].mean()*100))
print("Lower class survival rate =\t{:.2f}%".format(train[train['Pclass'] == 3]['Survived'].mean()*100))

Upper class survival rate =	62.96%
Middle class survival rate =	47.28%
Lower class survival rate =	24.24%

Comment
The above plot confirms our assumption: upper class passengers had a much higher survival rate.

Before we look at the impact of the fare on the survival rate, we need to verify how the class is correlated to the fare.

# Plot the survival rate per class
pal = {1:"gold",2:"silver",3:'sandybrown'}
plt.subplots(figsize = (6,6))
ax = sns.barplot(x = "Pclass", y = "Fare", data=train, palette = pal)
plt.title("Impact of the passenger class on the fare",fontsize=15)
plt.ylabel("Fare",fontsize=15)
plt.xlabel("Pclass",fontsize=15);

print('Median fare:')
print("1st class =\t$${:.2f}".format(train[train['Pclass'] == 1]['Fare'].median()))
print("2nd class =\t$${:.2f}".format(train[train['Pclass'] == 2]['Fare'].median()))
print("3nd class =\t$${:.2f}".format(train[train['Pclass'] == 3]['Fare'].median()))

Median fare:
1st class =	$$60.29
2nd class =	$$14.25
3nd class =	$$8.05

# Box plot
fig, ax = plt.subplots(figsize=(16,6))
sns.boxplot(x = 'Pclass', y = 'Fare', hue = 'Survived', data = train, ax = ax)
ax.set_title('Pclass vs Fare Survival Comparison',fontsize=15)
ax.set_ylim(0, 300);

fig = plt.figure(figsize=(16,6),)
ax=sns.distplot(train['Fare'] , color='blue',kde=False,bins=20)
ax.set_ylabel("Count",fontsize=15)
ax.set_title('Fare distribution',fontsize=15)
ax.set_xticks(range(0,600,25));

# Kernel Density Plot
fig = plt.figure(figsize=(16,6))
ax=sns.distplot(train.loc[(train['Survived'] == 0),'Fare'] , color='red',label='Not Survived',kde=False,bins=21)
ax=sns.distplot(train.loc[(train['Survived'] == 1),'Fare'] , color='blue', label='Survived',kde=False,bins=41)
ax.legend(['Not Survived','Survived'])
plt.xlabel("Fare",fontsize=15)
plt.ylabel("Frequency of Passenger Survived",fontsize=15)
plt.title('Fare Distribution Survived vs Non Survived',fontsize=15);

Comment
As expected. the survival rate increases with the fare price. Based on the above plot, it appears that the survival rate is larger than 50% for fares higher that $$150.

4.4 SibSp & Parch

train[['SibSp','Survived']].groupby(['SibSp'],as_index=False).mean().sort_values(by='Survived',ascending =False)

	SibSp	Survived
1	1	0.535885
2	2	0.464286
0	0	0.345395
3	3	0.250000
4	4	0.166667
5	5	0.000000
6	8	0.000000

fig, ax = plt.subplots(figsize=(16,6))
sns.barplot(x="SibSp", y="Survived",color='r',data=train);

train[['Parch','Survived']].groupby(['Parch'],as_index=False).mean().sort_values(by='Survived',ascending =False)

	Parch	Survived
3	3	0.600000
1	1	0.550847
2	2	0.500000
0	0	0.343658
5	5	0.200000
4	4	0.000000
6	6	0.000000

fig, ax = plt.subplots(figsize=(16,6))
sns.barplot(x="Parch", y="Survived",color='b', data=train);

Comment
Passenger traveling with large family decreased the survival rate.

4.5 Embarked

train[['Embarked','Survived']].groupby(['Embarked'],as_index=False).mean().sort_values(by='Survived',ascending =False)

	Embarked	Survived
0	C	0.553571
1	Q	0.389610
2	S	0.339009

5. Statistical Study

In this section, we will inspect the data and quantify the observations that result from the data visualization.

5.1 Main Features

# Turning the Sex feature into a boolean classifier
train['Sex'] = train['Sex'].apply(lambda x: 0 if x == "female" else 1)
test['Sex'] = test['Sex'].apply(lambda x: 0 if x == "female" else 1)

train.describe()

	Age	Fare	Parch	PassengerId	Pclass	Sex	SibSp	Survived
count	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000
mean	29.188182	32.204208	0.381594	446.000000	2.308642	0.647587	0.523008	0.383838
std	13.337887	49.693429	0.806057	257.353842	0.836071	0.477990	1.102743	0.486592
min	0.420000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000
25%	22.000000	7.910400	0.000000	223.500000	2.000000	0.000000	0.000000	0.000000
50%	26.000000	14.454200	0.000000	446.000000	3.000000	1.000000	0.000000	0.000000
75%	36.000000	31.000000	0.000000	668.500000	3.000000	1.000000	1.000000	1.000000
max	80.000000	512.329200	6.000000	891.000000	3.000000	1.000000	8.000000	1.000000

Comment
From the statistical data above, it appears that only 38% of the passengers survived.

train[['Sex', 'Survived']].groupby("Sex").mean()

	Survived
Sex
0	0.742038
1	0.188908

train[['Pclass', 'Survived']].groupby("Pclass").mean()

	Survived
Pclass
1	0.629630
2	0.472826
3	0.242363

5.2 Correlation Study

# Feature correlation
train.corr()['Survived'].sort_values()

Sex           -0.543351
Pclass        -0.338481
Age           -0.058635
SibSp         -0.035322
PassengerId   -0.005007
Parch          0.081629
Fare           0.257307
Survived       1.000000
Name: Survived, dtype: float64

# Compute the correlation matrix
corr = train.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
ax = sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.0,vmin=-1.0, center=0,annot=True,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

ylabels = [x.get_text() for x in ax.get_yticklabels()]
plt.yticks(np.arange(len(ylabels))+0.5,ylabels, rotation=0, fontsize="10", va="center")
ax.set_ylim(8,0.5);

Comment

Strong positive correlations:

Parch and SibSp (0.41)
Fare and Survived (0.26)
Parch and Fare (0.22)

Strong negative correlation

Fare and Pclass (-0.42)
Pclass and Age (-0.42)
Pclass and Survived (-0.34)

6. Feature Preparation

Based on the knowledge gathered, we can now create new features that will help improve the model accuracy.

6.1 Passenger Title

Upon inspection of the Name feature, it appear that a title is assigned to each passenger. We extract this feature and store it in the dataset.

# extrace new feature using regular expression
df_all['Title'] = df_all['Name'].str.extract(r' ([A-Za-z]+)\.',expand=False)
df_all['Is_Married'] = 0
df_all['Is_Married'].loc[df_all['Title']=='Mrs'] = 1

fig, ax = plt.subplots(figsize=(16,6))
sns.barplot(x=df_all['Title'].value_counts().index, y=df_all['Title'].value_counts().values, ax=ax)
ax.set_title("Title Distribution");

There are four principle titles. The rest consists of mostly single values. We can see if the survival rate varies between titles.

df_plot, _ = divide_df(df_all)
df_plot = df_plot.groupby(['Title', 'Survived']).size().reset_index().pivot(columns='Survived', index='Title').fillna(0).astype(int)

df_plot = df_plot.div(df_plot.sum(axis=1), axis=0)

fig, ax = plt.subplots(figsize=(16,6))
#ax = sns.countplot(x='Deck', data=df_all, hue='Pclass')
df_plot.plot(kind='bar', stacked=True, ax = ax, color=['r', 'b'])
ax.set_title('Survival Rate per Decks', fontsize=15)
ax.set_ylabel("Survival Rate")
ax.legend(['Not Survived', 'Survived'], loc="upper right")
ax.set_ylim(0,1);

Based on the results shown above, it appears that different title are used to describe the same status. For instance Miss, Mlle, and Ms are used to describe Miss. We standardize the titles using a custom function.

df_all['Title'] = df_all['Title'].replace(
    ['Miss', 'Mrs', 'Ms', 'Mlle', 'Lady', 'Mme', 'Countess', 'Dona'],'Miss/Mrs/Ms')
df_all['Title'] = df_all['Title'].replace(
    ['Dr', 'Col', 'Major', 'Jonkheer', 'Capt', 'Sir', 'Don', 'Rev'], 'Rare')

df_plot, _ = divide_df(df_all)
df_plot = df_plot.groupby(['Title', 'Survived']).size().reset_index().pivot(
    columns='Survived', index='Title').fillna(0).astype(int)

df_plot = df_plot.div(df_plot.sum(axis=1), axis=0)

fig, axes = plt.subplots(1, 2, figsize=(16,6))
df_plot.plot(kind='bar', stacked=True, ax = axes[0], color=['r', 'b'])
axes[0].set_title('Survival Rate per Decks', fontsize=15)
axes[0].set_ylabel("Survival Rate")
axes[0].legend(['Not Survived', 'Survived'], loc="upper right")
axes[0].set_ylim(0,1)

sns.barplot(x=df_all['Title'].value_counts().index, y=df_all['Title'].value_counts().values, ax=axes[1])
axes[1].set_title("Title Distribution");

As displayed above the passenger title influences the survival rate.

6.2 Family Size

We can create a new feature used to calculate the size of the family:

df_all['Family_Size'] = df_all['Parch'] + df_all['SibSp'] + 1

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

sns.barplot(x=df_all['Family_Size'].value_counts().index,
            y=df_all['Family_Size'].value_counts(),
            ax=axes[0])
axes[0].set_title("Family Size Distribution", fontsize=15)
sns.countplot(x='Family_Size',
              hue='Survived',
              data=df_all,
              ax=axes[1],
              palette=sns.diverging_palette(10, 255, sep=80, n=2))
axes[1].legend(['Not Survived', 'Survived'], loc='upper right')
axes[1].set_title("Survival Rate based on Family Size", fontsize=15)

Text(0.5, 1.0, 'Survival Rate based on Family Size')

As shown above, single individual tend to die more than families of 2, 3, and 4 individuals. However, larger families suffer more casualties.

# group family sizes
family_map = {
    1: 'Alone',
    2: 'Small',
    3: 'Small',
    4: 'Small',
    5: 'Medium',
    6: 'Medium',
    7: 'Large',
    8: 'Large',
    11: 'Large'
}

df_all['Family_Size_Grp'] = df_all['Family_Size'].map(family_map)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

sns.barplot(x=df_all['Family_Size_Grp'].value_counts().index,
            y=df_all['Family_Size_Grp'].value_counts(),
            ax=axes[0])
axes[0].set_title("Family Size Distribution", fontsize=15)
sns.countplot(x='Family_Size_Grp', hue='Survived', data=df_all, ax=axes[1], palette=sns.diverging_palette(10, 255, sep=80, n=2))
axes[1].legend(['Not Survived', 'Survived'], loc='upper right')
axes[1].set_title("Survival Rate based on Family Size", fontsize=15);

Based on our observations, we have grouped the family sizes into four groups. However, this new feature does not account for people not related but traveling in groups.

6.3 Tickets

We have not used the ticket id yet, If we simply try to group them, maybe we can obtain something interesting.

df_all.groupby('Ticket').size().sort_values(ascending=False)

Ticket
CA. 2343        11
1601             8
CA 2144          8
S.O.C. 14879     7
PC 17608         7
                ..
349248           1
349247           1
349246           1
349245           1
345769           1
Length: 929, dtype: int64

Indeed, the ticket seemed to have been assigned to groups of people rather than being unique. Let’s replace the actual tickets id as they do not see to contain a pattern.

df_all['Ticket_Frequency'] = df_all.groupby('Ticket')['Ticket'].transform('count')

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

sns.barplot(x=df_all['Ticket_Frequency'].value_counts().index,
            y=df_all['Ticket_Frequency'].value_counts(),
            ax=axes[0])
axes[0].set_title("Ticket Frequency Distribution", fontsize=15)
sns.countplot(x='Ticket_Frequency', hue='Survived', data=df_all, ax=axes[1], palette=sns.diverging_palette(10, 255, sep=80, n=2))
axes[1].legend(['Not Survived', 'Survived'], loc='upper right')
axes[1].set_title("Survival Rate based on  Ticket Frequency", fontsize=15);

Similar to the family size, single individual tend to die more than families of 2, 3, and 4 individuals. However, larger families suffer more casualties.

6.4 Family Survival Rates

Based on the passenger last names, we can construct statistics related to individual families.

df_all['Last_Name'] = df_all['Name'].str.extract(r'^([a-zA-Z\s\-\']+)\,*')

If we want to compute the survival rate per family, we need to restrain our dataset to the training set since the test set does not contained the target feature.

df_train, df_test = divide_df(df_all)

In order to generate our new features, we need to perform the following steps:

Find last names present in both the training and test sets.
Compute median survival rate for each ticket and family.
Save survival rates for families present in both test with more that one member. Same for tickets.
Combine ticket and family survival rates

# 1. families and tickets occuring in both sets
non_unique_families = df_train.loc[df_train['Last_Name'].isin(df_test['Last_Name'].unique()), 'Last_Name']
non_unique_tickets = df_train.loc[df_train['Ticket'].isin(df_test['Ticket'].unique()), 'Ticket']

# 2. compute median survival rate and size for each ticket and family
df_family = df_train.groupby('Last_Name').median()[['Survived', 'Family_Size']]
df_ticket = df_train.groupby('Ticket').median()[['Survived', 'Ticket_Frequency']]

# 3. filter families with more that one member and present in both the train and test sets
df_family = df_family.loc[(df_family.index.isin(non_unique_families)) & (df_family['Family_Size']>1)][['Survived']]
df_ticket = df_ticket.loc[(df_ticket.index.isin(non_unique_tickets)) & (df_ticket['Ticket_Frequency']>1)][['Survived']]

# 3. compute mean survival rate
mean_survival_rate = np.mean(df_train['Survived'])

# 3.
# assign family survival rate to each passenger, same for tickets
df_train = df_train.merge(right=df_family, how='left', left_on='Last_Name',right_index=True, suffixes=('', '_y'))
df_train = df_train.rename(columns={"Survived_y": "Family_Survival"})
df_train = df_train.merge(right=df_ticket, how='left', left_on='Ticket',right_index=True, suffixes=('', '_y'))
df_train = df_train.rename(columns={"Survived_y": "Ticket_Survival"})

df_test = df_test.merge(right=df_family, how='left', left_on='Last_Name', right_index=True, suffixes=('', '_y'))
df_test = df_test.rename(columns={"Survived": "Family_Survival"})
df_test = df_test.merge(right=df_ticket, how='left', left_on='Ticket', right_index=True, suffixes=('', '_y'))
df_test = df_test.rename(columns={"Survived": "Ticket_Survival"})

# new feature to determine if a family has a family-based survival rate, same for tickets
df_train['Has_Family_Survival'] = (~df_train['Family_Survival'].isnull()).astype(int)
df_test['Has_Family_Survival'] = (~df_test['Family_Survival'].isnull()).astype(int)
df_train['Has_Ticket_Survival'] = (~df_train['Ticket_Survival'].isnull()).astype(int)
df_test['Has_Ticket_Survival'] = (~df_test['Ticket_Survival'].isnull()).astype(int)
     
# fill null with mean survival rate, same for tickets
df_train['Family_Survival'] = df_train['Family_Survival'].fillna(mean_survival_rate)
df_test['Family_Survival'] = df_test['Family_Survival'].fillna(mean_survival_rate)
df_train['Ticket_Survival'] = df_train['Ticket_Survival'].fillna(mean_survival_rate)
df_test['Ticket_Survival'] = df_test['Ticket_Survival'].fillna(mean_survival_rate)

# 4. Combine survival rates (Family and Ticket)
for df in [df_train, df_test]:
    df['Survival_Rate'] = (df['Ticket_Survival'] + df['Family_Survival']) / 2
    df['Has_Survival'] = (df['Has_Ticket_Survival'] + df['Has_Family_Survival']) / 2  

6.5 Fare Binning

In order to improve our predictions on unseen data, it is common to bin continuous features. Therefore, we bin the fare using quantiles. We select a number of bins with the intent to create bins as pure as possible.

df_all = concat_df(df_train, df_test)

df_all['Fare_Bin'] = pd.qcut(df_all['Fare'], 13)

fig, axs = plt.subplots(figsize=(16, 6))
sns.countplot(x='Fare_Bin', hue='Survived', data=df_all, palette=sns.diverging_palette(10, 255, sep=80, n=2))

plt.xlabel('Fare Bins')
plt.ylabel('Passenger Count')
plt.legend(['Not Survived', 'Survived'], loc='upper right')
plt.title('Count of Survival in {} Feature'.format('Fare Bin'), size=15)
plt.tight_layout()

6.6 Age Binning

Similar to the Fare feature, we bin the age feature.

df_all['Age_Bin'] = pd.qcut(df_all['Age'], 10)

fig, axs = plt.subplots(figsize=(16, 6))
sns.countplot(x='Age_Bin', hue='Survived', data=df_all, palette=sns.diverging_palette(10, 255, sep=80, n=2))

plt.xlabel('Age Bins')
plt.ylabel('Passenger Count')
plt.legend(['Not Survived', 'Survived'], loc='upper right')
plt.title('Count of Survival in {} Feature'.format('Age Bin'), size=15)
plt.tight_layout()

6.7 Encoding

In order for our model to interpret categorical and non-numerical features, we need to generate new feature for each possible label.

The following features need to be encoded:

Embarked
Sex
Deck
Family_Size_Grp
Age_Bin
Fare_Bin
Title

df_train, df_test = divide_df(df_all)

# features to be encoded
encoding = ['Embarked', 'Sex', 'Deck', 'Family_Size_Grp', 'Age_Bin', 'Fare_Bin', 'Title']

for feature in encoding:
    encoder = LabelEncoder()
    df_train[feature] = encoder.fit_transform(df_train[feature])
    df_test[feature] = encoder.transform(df_test[feature])

Once the features have been encoded, we can create dummies.

dummies = ['Pclass', 'Sex', 'Deck', 'Embarked', 'Title', 'Family_Size_Grp']

df_train = pd.get_dummies(df_train, prefix_sep='_', columns=dummies, drop_first=False)
df_test = pd.get_dummies(df_test, prefix_sep='_', columns=dummies, drop_first=False)

Finally, we can drop several columns that we do not need.

df_all = concat_df(df_train, df_test)

drop_columns = ['Family_Size', 'Survived', 'Name', 'Parch', 'Last_Name', 'PassengerId', 'SibSp',
                'Ticket', 'Ticket_Survival', 'Family_Survival', 'Has_Family_Survival', 'Has_Ticket_Survival',
                'Age', 'Fare']

drop_columns_test = ['Family_Size', 'Name', 'Parch', 'Last_Name', 'PassengerId', 'SibSp',
                     'Ticket', 'Ticket_Survival', 'Family_Survival', 'Has_Family_Survival', 'Has_Ticket_Survival',
                     'Age', 'Fare']

df_all = df_all.drop(columns=drop_columns)
df_all.head()

	Age_Bin	Deck_0	Deck_3	Embarked_0	Embarked_2	Family_Size_Grp_0	...	Pclass_3	Sex_0	Sex_1	Survival_Rate	Ticket_Frequency	Title_1	Title_2
0	2	0	1	0	1	0	...	1	0	1	0.383838	1	0	1
1	7	1	0	1	0	0	...	0	1	0	1.000000	2	1	0
2	4	0	1	0	1	1	...	1	1	0	0.383838	1	1	0
3	7	1	0	0	1	0	...	0	1	0	0.383838	2	1	0
4	7	0	1	0	1	1	...	1	0	1	0.383838	1	0	1

5 rows × 26 columns

df_all.columns

Index(['Age_Bin', 'Deck_0', 'Deck_1', 'Deck_2', 'Deck_3', 'Embarked_0',
       'Embarked_1', 'Embarked_2', 'Family_Size_Grp_0', 'Family_Size_Grp_1',
       'Family_Size_Grp_2', 'Family_Size_Grp_3', 'Fare_Bin', 'Has_Survival',
       'Is_Married', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_0', 'Sex_1',
       'Survival_Rate', 'Ticket_Frequency', 'Title_0', 'Title_1', 'Title_2',
       'Title_3'],
      dtype='object')

7. Model Preparation

We will now prepare the data before creating a model. The preparation is divided into three steps:

Separate the dataframe into our input data and our output feature (X and y).
Normalize the data

X_train = StandardScaler().fit_transform(df_train.drop(columns=drop_columns))
y_train = df_train['Survived'].values
X_test = StandardScaler().fit_transform(df_test.drop(columns=drop_columns_test))

print('X_train shape: {}'.format(X_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('X_test shape: {}'.format(X_test.shape))

X_train shape: (891, 26)
y_train shape: (891,)
X_test shape: (418, 26)

# PCA
pca = PCA(random_state=42)
X_pca = pca.fit_transform(X_train)

fig, ax = plt.subplots(figsize=(12,6))
ax.plot(np.cumsum(pca.explained_variance_ratio_))
ax.set_xlabel("Component")
ax.set_ylabel("Percent of Explained Variance")
ax.set_title("Cumulative Variance", fontsize=15)
ax.set_ylim(0,1)
ax.set_xlim(0,pca.n_components_);

fig, ax = plt.subplots(figsize=(16,16))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
cset = ax.imshow(
    pca.components_.T,
    cmap = cmap,
    vmin=-1,
    vmax=1)

ylabels = df_train.drop(columns=drop_columns).columns.values.tolist()
ax.set_yticks(np.arange(len(ylabels)))
ax.set_yticklabels(ylabels, rotation=0)
ax.set_ylim(-0.5,len(ylabels)-0.5)
ax.set_xticks(range(pca.n_components_))
ax.grid(False)
ax.set_xlabel("Principal Component", fontsize=15)
ax.set_ylabel("Contribution", fontsize=15)
ax.set_title("Contribution of Features to Components", fontsize=15)
fig.colorbar(cset, ax=ax, shrink=0.5)
plt.tight_layout();

8. Models

In this section, we will make predictions using the following models:

RandomForestClassifier
ExtraTreesClassifier
LogisticRegression
GradientBoostingClassifier
LinearDiscriminantAnalysis
RidgeClassifier
XGBClassifier
MLPClassifier
BaggingClassifier
BernoulliNB
ExtraTreeClassifier
DecisionTreeClassifier
LinearSVC
AdaBoostClassifier
SVC
NuSVC
SGDClassifier
Perceptron
GaussianProcessClassifier
KNeighborsClassifier
GaussianNB
PassiveAggressiveClassifier
QuadraticDiscriminantAnalysis

# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=5, random_state=5, shuffle=True)

random_state = 42

# classifiers
classifiers_list = [
    #Ensemble Methods
    AdaBoostClassifier(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state)),
    BaggingClassifier(random_state=random_state),
    ExtraTreesClassifier(random_state=random_state),
    GradientBoostingClassifier(random_state=random_state),
    RandomForestClassifier(random_state=random_state),

    #Gaussian Processes
    GaussianProcessClassifier(random_state=random_state),
    
    #GLM
    LogisticRegression(random_state=random_state),
    PassiveAggressiveClassifier(random_state=random_state),
    RidgeClassifier(),
    SGDClassifier(random_state=random_state),
    Perceptron(random_state=random_state),
    MLPClassifier(random_state=random_state),
    
    #Navies Bayes
    BernoulliNB(),
    GaussianNB(),
    
    #Nearest Neighbor
    KNeighborsClassifier(),
    
    #SVM
    SVC(probability=True, random_state=random_state),
    NuSVC(probability=True, random_state=random_state),
    LinearSVC(random_state=random_state),
    
    #Trees    
    DecisionTreeClassifier(random_state=random_state),
    ExtraTreeClassifier(random_state=random_state),
    
    #Discriminant Analysis
    LinearDiscriminantAnalysis(),
    QuadraticDiscriminantAnalysis(),

    #xgboost: http://xgboost.readthedocs.io/en/latest/model.html
    XGBClassifier()    
    ]

# store cv results in list
cv_results_list = []
cv_means_list = []
cv_std_list = []

# perform cross-validation
for clf in tqdm(classifiers_list):
    cv_results_list.append(cross_val_score(clf,
                                           X_train,
                                           y_train,
                                           scoring = "accuracy",
                                           cv = kfold,
                                           n_jobs=-1))
    clf.fit(X_train, y_train)
    pred = clf.predict(X_test)

# store mean and std accuracy
for cv_result in cv_results_list:
    cv_means_list.append(cv_result.mean())
    cv_std_list.append(cv_result.std())
                      
cv_res_df = pd.DataFrame({"CrossValMeans":cv_means_list,
                          "CrossValerrors": cv_std_list,
                          "Algorithm":[clf.__class__.__name__ for clf in classifiers_list]})                    

cv_res_df = cv_res_df.sort_values(by='CrossValMeans',ascending=False)             

100%|██████████████████████████████████████████████████████████████████████████████████| 23/23 [00:20<00:00,  1.12it/s]

cv_res_df.set_index('Algorithm')

	CrossValMeans	CrossValerrors
Algorithm
LogisticRegression	0.853035	0.035074
GradientBoostingClassifier	0.853010	0.028286
LinearDiscriminantAnalysis	0.851893	0.026919
RidgeClassifier	0.851893	0.026919
LinearSVC	0.850781	0.030352
MLPClassifier	0.850769	0.026927
XGBClassifier	0.850769	0.027849
NuSVC	0.840675	0.028222
SVC	0.840669	0.028718
SGDClassifier	0.837261	0.028204
GaussianProcessClassifier	0.835057	0.026180
RandomForestClassifier	0.831674	0.023097
KNeighborsClassifier	0.829452	0.027523
BaggingClassifier	0.823828	0.026461
ExtraTreesClassifier	0.821581	0.026906
AdaBoostClassifier	0.818204	0.028622
DecisionTreeClassifier	0.810345	0.019053
GaussianNB	0.808123	0.032538
ExtraTreeClassifier	0.805850	0.016329
BernoulliNB	0.795750	0.011812
Perceptron	0.766675	0.060092
QuadraticDiscriminantAnalysis	0.754234	0.024949
PassiveAggressiveClassifier	0.727224	0.068009

# plot results
fig, ax = plt.subplots(figsize=(12,12))
g = sns.barplot("CrossValMeans",
                "Algorithm",
                data = cv_res_df,
                palette="Set3",
                orient = "h",
                **{'xerr':cv_std_list})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

# store best models
best_models = []

def optimize_model(model, X_train, y_train, paramgrid, random_state=10, suffix='_best',
                  metric = 'accuracy', kfold=StratifiedKFold(n_splits=5, random_state=5, shuffle=True), verbose=1, other_args={}, print_best=True):
    
    # adjust base estimator parameters
    estimator = model(**other_args)
    
    # create k-fold-based grid-search
    gridsearch = GridSearchCV(estimator, param_grid=paramgrid, cv=kfold, scoring=metric, n_jobs=-1, verbose=verbose)
    
    # fit grid search
    gridsearch.fit(X_train, y_train)
    
    # store best model
    name = model.__name__ + suffix
    best_models.append((name, gridsearch.best_estimator_))
    
    # print (optional)
    if print_best:
        print(gridsearch)
    
    # display accuracy
    print('Best {} model archieves {:.2f}% {}'.format(model.__name__, 100 * gridsearch.best_score_, metric))

param_grid = {
    'max_depth': [3, 5, 6, 7],
    'learning_rate': [0.01, 0.025, 0.05, 0.075, 0,1, 0,15, 0.2],
    'n_estimators': [500, 1000, 1500],
}

optimize_model(XGBClassifier, X_train, y_train, paramgrid=param_grid)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:  3.5min finished


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=5, shuffle=True),
             error_score=nan,
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='deprecated', n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.025, 0.05, 0.075, 0, 1, 0,
                                           15, 0.2],
                         'max_depth': [3, 5, 6, 7],
                         'n_estimators': [500, 1000, 1500]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
Best XGBClassifier model archieves 85.64% accuracy

param_grid = {
    "max_depth": [None],
    "max_features": [5, 7, 'auto'],
    "min_samples_split": [4, 5, 6],
    "min_samples_leaf": [4, 5, 6],
    "bootstrap": [False],
    "n_estimators": [1500, 2000],
    "criterion": ["gini"]
}

optimize_model(ExtraTreesClassifier, X_train, y_train, paramgrid=param_grid)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   18.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:  2.3min finished


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=5, shuffle=True),
             error_score=nan,
             estimator=ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0,
                                            class_weight=None, criterion='gini',
                                            max_depth=None, max_features='auto',
                                            max_leaf_nodes=None,
                                            max_samples=None,
                                            min_impurity_decrease=0.0,
                                            min_impurity_split=None,
                                            min_samples_leaf=1,
                                            min_samples_split=2,
                                            min_w...
                                            oob_score=False, random_state=None,
                                            verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'bootstrap': [False], 'criterion': ['gini'],
                         'max_depth': [None], 'max_features': [5, 7, 'auto'],
                         'min_samples_leaf': [4, 5, 6],
                         'min_samples_split': [4, 5, 6],
                         'n_estimators': [1500, 2000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
Best ExtraTreesClassifier model archieves 83.96% accuracy

param_grid = {
    "max_depth": [5, 6, 7],
    "min_samples_split": [4, 5, 6],
    "min_samples_leaf": [4, 5, 6],
    "max_features": [5, 7, 'auto'],
    "n_estimators": [1750],
    "criterion": ["gini"]
}

optimize_model(RandomForestClassifier, X_train, y_train, paramgrid=param_grid)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   29.4s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed:  4.3min finished


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=5, shuffle=True),
             error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_...
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              random_state=None, verbose=0,
                                              warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'criterion': ['gini'], 'max_depth': [5, 6, 7],
                         'max_features': [5, 7, 'auto'],
                         'min_samples_leaf': [4, 5, 6],
                         'min_samples_split': [4, 5, 6],
                         'n_estimators': [1750]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
Best RandomForestClassifier model archieves 84.74% accuracy

param_grid = {
    'kernel': ['rbf'],
    'gamma': [0.0005, 0.0008, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1],
    'C': [1, 10, 50, 100, 200, 250, 500]
}

optimize_model(SVC, X_train, y_train, paramgrid=param_grid, other_args={'probability':True})

Fitting 5 folds for each of 63 candidates, totalling 315 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 300 out of 315 | elapsed:    9.5s remaining:    0.4s


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=5, shuffle=True),
             error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=True, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [1, 10, 50, 100, 200, 250, 500],
                         'gamma': [0.0005, 0.0008, 0.001, 0.005, 0.01, 0.05,
                                   0.1, 0.5, 1],
                         'kernel': ['rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
Best SVC model archieves 85.42% accuracy


[Parallel(n_jobs=-1)]: Done 315 out of 315 | elapsed:   10.2s finished

param_grid = {"C": np.logspace(-3, 3, 40), "penalty": ["l1", "l2"]}

optimize_model(LogisticRegression, X_train, y_train, paramgrid=param_grid)

Fitting 5 folds for each of 80 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.0s


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=5, shuffle=True),
             error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated'...
       4.92388263e+00, 7.01703829e+00, 1.00000000e+01, 1.42510267e+01,
       2.03091762e+01, 2.89426612e+01, 4.12462638e+01, 5.87801607e+01,
       8.37677640e+01, 1.19377664e+02, 1.70125428e+02, 2.42446202e+02,
       3.45510729e+02, 4.92388263e+02, 7.01703829e+02, 1.00000000e+03]),
                         'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
Best LogisticRegression model archieves 85.64% accuracy


[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:    0.3s finished

param_grid = {
    "learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
    "subsample": [0.5, 0.62, 0.8, 0.85, 0.87, 0.9, 0.95, 1.0]
}

optimize_model(GradientBoostingClassifier,
               X_train,
               y_train,
               paramgrid=param_grid)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 5 folds for each of 56 candidates, totalling 280 fits


[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    1.5s


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=5, shuffle=True),
             error_score=nan,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min...
                                                  n_iter_no_change=None,
                                                  presort='deprecated',
                                                  random_state=None,
                                                  subsample=1.0, tol=0.0001,
                                                  validation_fraction=0.1,
                                                  verbose=0, warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'learning_rate': [0.01, 0.025, 0.05, 0.075, 0.1, 0.15,
                                           0.2],
                         'subsample': [0.5, 0.62, 0.8, 0.85, 0.87, 0.9, 0.95,
                                       1.0]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
Best GradientBoostingClassifier model archieves 86.20% accuracy


[Parallel(n_jobs=-1)]: Done 280 out of 280 | elapsed:    6.8s finished

param_grid = {
    "n_neighbors": [3, 5, 7, 11, 13, 15],
    "weights": ['uniform', 'distance'],
    "metric": ['manhattan'],
    "algorithm":['auto']
}

optimize_model(KNeighborsClassifier,
               X_train,
               y_train,
               paramgrid=param_grid)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=5, shuffle=True),
             error_score='raise-deprecating',
             estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                            metric='minkowski',
                                            metric_params=None, n_jobs=None,
                                            n_neighbors=5, p=2,
                                            weights='uniform'),
             iid='warn', n_jobs=-1,
             param_grid={'algorithm': ['auto'], 'metric': ['manhattan'],
                         'n_neighbors': [3, 5, 7, 11, 13, 15],
                         'weights': ['uniform', 'distance']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='accuracy', verbose=1)
Best KNeighborsClassifier model archieves 83.95% accuracy


[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:    4.1s finished

LDA = LinearDiscriminantAnalysis()
LDA.fit(X_train, y_train)
LDA_best = LDA
best_models.append(("LDA_best", LDA_best))

predictions = {}
N_fold = 5


def generalize_predictions(model,
                           X_train,
                           y_train,
                           X_test,
                           skf=StratifiedKFold(n_splits=N_fold,
                                               random_state=N_fold,
                                               shuffle=True)):

    # store predicted probas for each folds
    probas = pd.DataFrame(np.zeros((len(X_test), N_fold * 2)),
                          columns=[
                              'Fold_{}_Class_{}'.format(i, j)
                              for i in range(1, N_fold + 1) for j in range(2)
                          ],
                          index=passengerId)

    # for each fold, train then predict
    for fold_id, (train_idx,
                  val_idx) in tqdm(enumerate(skf.split(X_train, y_train), 1)):

        # Fitting the model
        model.fit(X_train[train_idx], y_train[train_idx])

        # X_test probabilities
        probas.loc[:, 'Fold_{}_Class_0'.format(fold_id)] = model.predict_proba(
            X_test)[:, 0]
        probas.loc[:, 'Fold_{}_Class_1'.format(fold_id)] = model.predict_proba(
            X_test)[:, 1]

    # save results
    return (probas)

predictions = {name: generalize_predictions(model, X_train, y_train, X_test) for name, model in best_models}

5it [00:07,  1.49s/it]
5it [00:13,  2.70s/it]
5it [00:13,  2.78s/it]
5it [00:01,  3.91it/s]
5it [00:00, 40.15it/s]
5it [00:00,  6.12it/s]
5it [00:00, 18.64it/s]
5it [00:00, 116.72it/s]

def make_predictions(model_df, mode="hard", threshold=0.5):
    
    predictions = pd.DataFrame(np.zeros((len(X_test), 3)),
                     columns=['1', '0', 'pred'])
    
    # isolate probabilities of class 1
    class_one = [col for col in model_df.columns if col.endswith('Class_1')]
    
    # compute average of class 1 probabilities
    predictions['1'] = model_df[class_one].sum(axis=1) / N_fold
    predictions['0'] = 1. - predictions['1']
    
    if mode=="hard":
        predictions['pred'] = (predictions['1'] >= threshold).astype(int)
    else:
        predictions['pred'] = predictions['1']
    
    model_results = pd.DataFrame(columns=['PassengerId', 'Survived'])
    model_results['PassengerId'] = df_test['PassengerId']
    model_results['Survived'] = predictions['pred']
    return model_results

results = [(key, make_predictions(model)) for key, model in predictions.items()]

9. Best Models

for name, model in best_models:
    print(name)

XGBClassifier_best
ExtraTreesClassifier_best
RandomForestClassifier_best
SVC_best
LogisticRegression_best
GradientBoostingClassifier_best
KNeighborsClassifier_best
LDA_best

predictions = []

for model_name, model in tqdm(best_models):
    # make predictions
    y_pred = model.predict(X_test)
    
    # store predictions
    predictions.append(pd.Series(y_pred, name=model_name))
    
# concatenate predictions
ensemble_results = pd.concat(predictions, axis=1)

100%|██████████| 8/8 [00:00<00:00, 16.13it/s]

# Generate a mask for the upper triangle
mask = np.zeros_like(ensemble_results.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(12, 10))

cmap = sns.diverging_palette(220, 10, as_cmap=True)

g = sns.heatmap(ensemble_results.corr(),
                annot=True,
                mask=mask,
                annot_kws={"fontsize": 14}, cmap=cmap)

ylabels = [x.get_text() for x in g.get_yticklabels()]
plt.yticks(np.arange(len(ylabels)) + 0.5,
           ylabels,
           rotation=0,
           fontsize="14",
           va="center")
g.set_ylim(len(best_models), 0.5);

def plot_learning_curve(models,
                        X,
                        y,
                        ylim=None,
                        cv=None,
                        n_jobs=-1,
                        train_sizes=np.linspace(.1, 1.0, 5)):

    # extract number of models
    n_models = len(models)

    # create figure
    fix, axes = plt.subplots(n_models, 1, figsize=(8, 5 * n_models))

    for idx, val in enumerate(models):
        # unpack
        name, model = val

        # scale y axis
        if ylim is not None: axes[idx].set_ylim(*ylim)
        # set title
        axes[idx].set_title(name + "learning curves")
        # set labels
        axes[idx].set_xlabel("Training size")
        axes[idx].set_ylabel("Score")
        # compute learning curves
        train_sizes, train_scores, test_scores = learning_curve(
            model, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
        # compute statistics
        train_score_mean = np.mean(train_scores, axis=1)
        train_score_std = np.std(train_scores, axis=1)
        test_score_mean = np.mean(test_scores, axis=1)
        test_score_std = np.std(test_scores, axis=1)

        axes[idx].fill_between(train_sizes,
                               train_score_mean - train_score_std,
                               train_score_mean + train_score_std,
                               alpha=0.1,
                               color='r')
        axes[idx].fill_between(train_sizes,
                               test_score_mean - test_score_std,
                               test_score_mean + test_score_std,
                               alpha=0.1,
                               color='g')

        axes[idx].plot(train_sizes,
                       train_score_mean,
                       'o-',
                       color="r",
                       label="Training score")
        axes[idx].plot(train_sizes,
                       test_score_mean,
                       'o-',
                       color="g",
                       label="Cross-validation score")
        axes[idx].legend(loc='best')
        
    plt.subplots_adjust(hspace=0.3)
    return fig, axes

plot_learning_curve(best_models,X_train,y_train,cv=kfold, ylim=(0.6,1.));

nrows = ncols = 2
fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))

names_classifiers = [("XGBClassifier_best",best_models[0][1]),
                     ("ExtraTreesClassifier_best",best_models[1][1]),
                     ("RandomForestClassifier_best",best_models[2][1]),
                     ("GradientBoostingClassifier_best",best_models[5][1])]

nclassifier = 0
for row in range(nrows):
    for col in range(ncols):
        name = names_classifiers[nclassifier][0]
        classifier = names_classifiers[nclassifier][1]
        indices = np.argsort(classifier.feature_importances_)[::-1][:40]
        g = sns.barplot(y=df_train.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])
        g.set_xlabel("Relative importance",fontsize=12)
        g.set_ylabel("Features",fontsize=12)
        g.tick_params(labelsize=9)
        g.set_title(name + " feature importance")
        nclassifier += 1
plt.tight_layout()

10. Create Submission

# generate submission file 
submission_df = pd.DataFrame({'PassengerId': passengerId,
                            'Survived': results[2][1].values.T[0]})
submission_df.to_csv("voting_submission_df.csv", index=False)

The submission leads to a 81.339% accuracy. This is puts the predictions in the top 3% of the Kaggle leaderboard.