Having fun with the Titanic Problem… Machine Learning from the Disaster

This is originally a Kaggle problem where the task is:

to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

The data is broken into train and test sets in a ratio of approximately 68 to 32 percent respectively with the following features:

survival: if the passenger survived; 0 = No, 1 = Yes. This is what we are trying to predict
pclass: ticket class; 1 = 1st, 2 = 2nd, 3 = 3rd
sex: the gender of the passenger (male/female)
age: age in years
sibsp: number of siblings / spouses aboard the Titanic
parch: number of parents / children aboard the Titanic
ticket: Ticket number
fare: Passenger fare
cabin: Cabin number
embarked: Port of Embarkation (or where the person boarded); C = Cherbourg, Q = Queenstown, S = Southampton

Getting Started

First we need to load the training data to see what we have…

import numpy as np   
import pandas as pd
import matplotlib.pyplot as plt

# note the raw... this is the untouched data.  
rawTrainingData = pd.read_csv ('train.csv')
rawTestData     = pd.read_csv ('test.csv')

# view the first few lines
rawTrainingData.head ()

Next we check the range of values:

rawTrainingData.describe ()

A few things to note: the average survival rate was 38.4% and the average ticket price was 32 pounds.

Look if any data is missing…

rawTrainingData.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

rawTestData.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Most things look ok except Age, Cabin, Fare, and Embarked are missing some elements.

Looking at the Data…

Cabin

rawTrainingData['Cabin'].value_counts ()

G6             4
B96 B98        4
C23 C25 C27    4
E101           3
F2             3
              ..
B94            1
E50            1
C111           1
C90            1
D47            1
Name: Cabin, Length: 147, dtype: int64

There is a lot of missing Cabin values. Also there is no noticeable pattern. Hence it will be removed/dropped from the dataset.

Age

First plot the age distribution…

plt.figure ()
plt.hist (rawTrainingData['Age'])
plt.show ()

Since the age isn’t a nice normal distribution, we shall use the median to fill in the missing values.

rawTrainingData['Age'].median ()

Fare

First we shall plot the data…

plt.figure ()
plt.hist (rawTrainingData['Fare'], bins=30)
plt.show ()

From the distribution and some experiments, the fares can be put into three different ranges… low, medium, and high.

pd.cut (rawTrainingData['Fare'], [0, 10, 25, 1000]).value_counts()

(25, 1000]    334
(0, 10]       321
(10, 25]      221
Name: Fare, dtype: int64

Embarked

Look at what it consists of…

rawTrainingData['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

Look at the rows with the missing values to try to make sense of things…

rawTrainingData[rawTrainingData['Embarked'].isna()]

There is something funny with this… both had the same ticket number, stayed in the same room, one was french and married, the other was single and english. Perhaps it is best to drop these two rows.

Sex

Looking at the values we see that it is easy enough to convert to a numeric value…

rawTrainingData['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

Ticket

There seems to be some repetition in tickets…

rawTrainingData['Ticket'].value_counts()

1601                 7
CA. 2343             7
347082               7
CA 2144              6
3101295              6
                    ..
A.5. 18509           1
STON/O 2. 3101292    1
3101277              1
345778               1
PC 17475             1
Name: Ticket, Length: 681, dtype: int64

Hence, we will try to group the tickets together if possible using…

rawTrainingData.groupby("Ticket")["Ticket"].count()

Name

From the Names we can extract some titles…

rawTrainingData['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

titles = {'Mr.': 0, 'Mrs.': 0, 'Miss.': 0, 'Master.': 0, 'Important': 0, 'Other': 0}
for i, name in enumerate (rawTrainingData['Name']):
    nameTokens = name.split (' ')
    if 'Mr.' in nameTokens:
        titles['Mr.'] = titles['Mr.'] + 1
        
    elif 'Mrs.' in nameTokens or 'Mme.' in nameTokens or 'Ms.' in nameTokens:
        titles['Mrs.'] = titles['Mrs.'] + 1
            
    elif 'Miss.' in nameTokens or 'Mlle.' in nameTokens:
        titles['Miss.'] = titles['Miss.'] + 1
            
    elif 'Master.' in nameTokens:
        titles['Master.'] = titles['Master.'] + 1
            
    elif 'Dr.' in nameTokens or 'Capt' in nameTokens or 'Col' in nameTokens or \
         'Major' in nameTokens or 'Rev.' in nameTokens: 
        titles['Important'] = titles['Important'] + 1
            
    else: # Sir, Lady, Countess, Jonkheer, and Don
        titles['Other'] = titles['Other'] + 1
        
titles
{'Mr.': 517,
 'Mrs.': 127,
 'Miss.': 184,
 'Master.': 40,
 'Important': 13,
 'Other': 10}

Transforming the Data…

We shall make a function that will transform the data. It can be used for both the test and train sets. For now we are just one-hot encoding the categorical variables to get a baseline set of results.

def TransformData (dataset):
    # Drop for now drop cabin, name, and ticket... in the future we will might try to use them
    transformedDataset = dataset.drop (['Cabin', 'Name', 'Ticket'], axis=1)    

    # Drop NA rows... looking at the embarked column...
    transformedDataset.dropna (axis=0, subset=['Embarked'], inplace=True)

    # Fill any missing ages with the median
    transformedDataset['Age'].fillna (transformedDataset['Age'].median(), inplace=True)

    # Fill any missing fares with the median
    transformedDataset['Fare'].fillna (transformedDataset['Fare'].median(), inplace=True)

    # One hot encode of the PClass column
    one_hot = pd.get_dummies (transformedDataset['Pclass'], prefix='PClass')
    transformedDataset.drop (['Pclass'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)

    # One hot encode of the Embarked column
    one_hot = pd.get_dummies (transformedDataset['Embarked'], prefix='Embarked')
    transformedDataset.drop (['Embarked'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)

    # One hot encode of the Sex column
    one_hot = pd.get_dummies (transformedDataset['Sex'], prefix='Sex')
    transformedDataset.drop (['Sex'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)

    return transformedDataset

Modeling…

A few different models will be used:

Support Vector Classifier
Random Forest Classifier
Logistic Regression
Extreme Gradient Boosting (or XGBoost)

The idea is to try a few models and compare the results give a final result.

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb

To test for the accuracy of the model, we will average the cross_val_score:

from sklearn.model_selection import cross_val_score

The following is how the models are computed…

# make xTrain and yTrain data...
cleanedTrainingData = TransformData (rawTrainingData)
yTrain = cleanedTrainingData ['Survived']
xTrain = cleanedTrainingData.drop (['Survived', 'PassengerId'], axis=1)

svm_clf = SVC (gamma="auto")
svm_clf.fit (xTrain, yTrain)
svm_scores = cross_val_score(svm_clf, xTrain, yTrain, cv=10)
print (svm_scores.mean())

rf_clf = RandomForestClassifier (random_state=0)
rf_clf.fit (xTrain, yTrain)
rf_scores = cross_val_score (rf_clf, xTrain, yTrain, cv=10)
print (rf_scores.mean())

logit_clf = LogisticRegression(random_state=0, solver='lbfgs', max_iter=10000)
logit_clf.fit (xTrain, yTrain)
logit_scores = cross_val_score (logit_clf, xTrain, yTrain, cv=10)
print (logit_scores.mean())

model = xgb.XGBClassifier()
model.fit (xTrain, yTrain)
xgbScores = cross_val_score (model, xTrain, yTrain, cv=10)
print (xgbScores.mean())

Results so far… the mean accuracy score based on the training data was:

SVC: 73.2%
Random Forest: 79.8%
Logistic Regression: 79.9%
XGBoost: 81.1%

Below are box-graphs of the results… the random forest, logistic regression, and xgboost are all good candidates for improvement. The box plots show ten different accuracy scores from the training data during the cross validation process, along with the median, and first and third quartiles.

Improving the Model…

Things are ok but can perhaps can be improved. We shall try to improve the data a bit…

use the names to have titles
bin the fares and ages
classify the number of people someone is travelling with
use the ticket groupings

The new transform data function becomes:

def TransformData_V2 (dataset):
    
    # Drop for now drop cabin, name, and ticket... in the future we will might try to use them
    transformedDataset = dataset.drop (['Cabin'], axis=1)    
    
    transformedDataset['Title'] = 0
    for i, name in enumerate (transformedDataset['Name']):
        nameTokens = name.split (' ')
        if 'Mr.' in nameTokens:
            transformedDataset.iloc [i, transformedDataset.columns.get_loc('Title')] = 1
            
        elif 'Mrs.' in nameTokens or 'Mme.' in nameTokens or 'Ms.' in nameTokens:
            transformedDataset.iloc [i, transformedDataset.columns.get_loc('Title')] = 2
            
        elif 'Miss.' in nameTokens or 'Mlle.' in nameTokens:
            transformedDataset.iloc [i, transformedDataset.columns.get_loc('Title')] = 3
            
        elif 'Master.' in nameTokens:
            transformedDataset.iloc [i, transformedDataset.columns.get_loc('Title')] = 4
            
        elif 'Dr.' in nameTokens or 'Capt' in nameTokens or 'Col' in nameTokens or \
             'Major' in nameTokens or 'Rev.' in nameTokens: 
            transformedDataset.iloc [i, transformedDataset.columns.get_loc('Title')] = 5
            
        else: # Sir, Lady, Countess, Jonkheer, and Don
            transformedDataset.iloc [i, transformedDataset.columns.get_loc('Title')] = 6
            
    transformedDataset.drop (['Name'], axis=1, inplace=True)        
    one_hot = pd.get_dummies (transformedDataset['Title'], prefix='Title')
    transformedDataset.drop (['Title'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)
    
    transformedDataset['TicketCount'] = transformedDataset.groupby('Ticket')['Ticket'].transform('count')
    transformedDataset.drop (['Ticket'], axis=1, inplace=True)
    
    transformedDataset['FamSize'] = transformedDataset['SibSp'] + transformedDataset['Parch']
    transformedDataset['Single'] = transformedDataset['FamSize'].map(lambda s: 1 if s == 1 else 0)
    transformedDataset['SmallFamily'] = transformedDataset['FamSize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
    transformedDataset['LargeFamily'] = transformedDataset['FamSize'].map(lambda s: 1 if 5 <= s else 0)

    transformedDataset = transformedDataset.drop (['SibSp', 'Parch'], axis=1)  
    
    # Drop NA rows... looking at the embarked column...
    transformedDataset.dropna (axis=0, subset=['Embarked'], inplace=True)
    
    # Fill any missing ages with the median
    transformedDataset['Age'].fillna (transformedDataset['Age'].median(), inplace=True)
    ageGroups = pd.cut (transformedDataset['Age'], bins=[0,15,30,45,60,100], labels=['A', 'B', 'C', 'D', 'E'])
    transformedDataset.insert (5, 'AgeGroup', ageGroups)
    one_hot = pd.get_dummies (transformedDataset['AgeGroup'], prefix='AgeGroup')
    transformedDataset.drop (['AgeGroup'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)
    
    # Fill any missing fares with the median
    transformedDataset['Fare'].fillna (transformedDataset['Fare'].median(), inplace=True)
    fareGroups = pd.cut (transformedDataset['Fare'], bins=[0,10,25,1000], labels=['A', 'B', 'C'])
    transformedDataset.insert (3, 'FareGroup', fareGroups)
    one_hot = pd.get_dummies (transformedDataset['FareGroup'], prefix='FareGroup')
    transformedDataset.drop (['FareGroup'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)    
        
    # One hot encode of the PClass column
    one_hot = pd.get_dummies (transformedDataset['Pclass'], prefix='PClass')
    transformedDataset.drop (['Pclass'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)

    # One hot encode of the Embarked column
    one_hot = pd.get_dummies (transformedDataset['Embarked'], prefix='Embarked')
    transformedDataset.drop (['Embarked'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)

    # One hot encode of the Sex column
    one_hot = pd.get_dummies (transformedDataset['Sex'], prefix='Sex')
    transformedDataset.drop (['Sex'], axis=1, inplace=True)
    transformedDataset = transformedDataset.join (one_hot)

    return transformedDataset

The results have improved a bit (hopefully the models are not overfitting the data)…

The average accuracy scores are now:

SVC: 76.4%
Random Forest: 80.1%
Logistic Regression: 81.9%
XGBoost: 81.3%

Tuning the Models…

Using grid search to find a better fitting model…

from sklearn.model_selection import GridSearchCV

# support vector machine...
parameter_grid = {
    'C' : [ 1, 10, 100, 1000, 10000],
    'kernel' : ['linear', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto']
    }

svm_clf = SVC ()

grid_search = GridSearchCV (svm_clf,
                            scoring='accuracy',
                            param_grid=parameter_grid,
                            cv=3,
                            n_jobs=2,
                            verbose=1)

grid_search.fit (xTrain, yTrain)
model = grid_search.best_estimator_
parameters = grid_search.best_params_

print(f'Best score: {grid_search.best_score_}')
print(f'Best estimator: {grid_search.best_estimator_}')


# Find the best parameters for the random forest
parameter_grid = {
    'max_depth' : [8, 10, 12],
    'n_estimators': [10, 50, 100],
    'max_features': ['sqrt'],
    'min_samples_split': [2, 3, 10],
    'min_samples_leaf': [1, 3, 10],
    'bootstrap': [True, False] }

rf_clf = RandomForestClassifier(n_jobs=2)

grid_search = GridSearchCV (rf_clf,
                            scoring='accuracy',
                            param_grid=parameter_grid,
                            cv=3,
                            n_jobs=2,
                            verbose=1)

grid_search.fit (xTrain, yTrain)
model = grid_search.best_estimator_
parameters = grid_search.best_params_

print(f'Best score: {grid_search.best_score_}')
print(f'Best estimator: {grid_search.best_estimator_}')


# Find the best parameters for logit classifier
parameter_grid = {
    'C' : [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100] }

logit_clf = LogisticRegression(random_state=0, solver='lbfgs', max_iter=10000, n_jobs=2)

grid_search = GridSearchCV (logit_clf,
                            scoring='accuracy',
                            param_grid=parameter_grid,
                            cv=3,
                            n_jobs=2,
                            verbose=1)

grid_search.fit (xTrain, yTrain)
model = grid_search.best_estimator_
parameters = grid_search.best_params_

print(f'Best score: {grid_search.best_score_}')
print(f'Best estimator: {grid_search.best_estimator_}')


# xgboost...
parameter_grid = {
    'max_depth' : [7, 8, 9],
    'max_delta_step': [1],
    'n_estimators': [20, 40, 60, 80],
    'colsample_bylevel': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'subsample': [0.3, 0.4, 0.5, 0.6],
}

xgb_clf = xgb.XGBClassifier()

grid_search = GridSearchCV (xgb_clf,
                            scoring='accuracy',
                            param_grid=parameter_grid,
                            cv=3,
                            n_jobs=2,
                            verbose=1)

grid_search.fit (xTrain, yTrain)
model = grid_search.best_estimator_
parameters = grid_search.best_params_

print(f'Best score: {grid_search.best_score_}')
print(f'Best estimator: {grid_search.best_estimator_}')

From the grid search we were able to produce the following results…

SVM: 82.6%
Random Forest: 83.5%
Logistic Regression: 82.0%
XGBoost: 81.9%

Finally Test out the results on Kaggle…

Using the final tuned models on the test set which were then verified with kaggle. First the code used to generate the kaggle output was:

cleanedTestingData = TransformData_V2 (rawTestData)
xTest = cleanedTestingData.drop (['PassengerId'], axis=1)

yPred_svm   = svm_clf.predict (xTest)
yPred_logit = logit_clf.predict (xTest)
yPred_rf    = rf_clf.predict (xTest)
yPred_xgb   = xgb_clf.predict (xTest)

output = pd.DataFrame({'PassengerId': cleanedTestingData.PassengerId, 'Survived': yPred_svm})
output.to_csv('sub_svm_tuned.csv', index=False)

output = pd.DataFrame({'PassengerId': cleanedTestingData.PassengerId, 'Survived': yPred_logit})
output.to_csv('sub_logit_tuned.csv', index=False)

output = pd.DataFrame({'PassengerId': cleanedTestingData.PassengerId, 'Survived': yPred_rf})
output.to_csv('sub_rf_tuned.csv', index=False)

output = pd.DataFrame({'PassengerId': cleanedTestingData.PassengerId, 'Survived': yPred_xgb})
output.to_csv('sub_xgb_tuned.csv', index=False)

A few noticeable things from the results:

All the steps do improve the results on the training data, however this might be a sign of overfitting
Logistic regression was the overall winner (78%) with how the data was transformed.
It seems that the random forest and xgboost classifiers can easily overfit the data.

	Support Vector Classifier	Random Forest	Logistic Regression	XGBoost
Plain Model	73.2%	79.8%	79.9%	81.1%
Model with new Features	76.4%	80.1%	81.9%	81.3%
Tuned Model	82.6%	83.5%	82.0%	81.9%
Testing Data	77.03%	76.56%	77.99%	71.5%

This problem was more of a learning exercise so instead of spending more time on this problem to try to improve the score on the testing set, perhaps its better to move on to the next problem.