Press enter to see results or esc to cancel.

Machine Learning from Disaster: Data exploration and feature engineering

In this post we will look at the basics of data loading, cleaning and visualisation with the Kaggle ‘Titanic: Machine Learning from disaster’ data set. The data provides details for a number of passengers, including age, class, ticket price. The aim of the next post will be to apply tools of machine learning to predict which passengers survived the tragedy.

Get the data

Pandas dataframes allow us to load in data in a variety of formats (CSV, JSON, SQL, Excel). In this example our training data is a CSV.

What does the data look like?

The first step of examining column/row based data sets is to see what columns are in the in table, and what types they are. We can preview the the first few rows of the data with:

This gives us an idea of the data available – now we want to check the completeness of the training data set. A good first check is see what entries contain blank, null or empty values.

Replacing the unknown ages of passengers

177 passengers have no age listed. While we could replace those values with the mean age (29 years) it would be less accurate than using another feature in the data to guess the ages. As all of the passengers are named, we can take the mean ages of the passengers’ titles to assign the missing values.

To extract the titles from the passengers’ names, we use a regular expression to get all letters A-Z or a-z that are followed by a full stop.

For simplicity, we will replace the titles with either Master, Miss, Mr, Mrs or other. We can go back and check these values for titles that apply to both sexes (Dr for example).

Now we have the average ages of the titles, we can fill the missing values.

Let’s see what the distribution of the numerical values looks like:

Numerical values summary

  • The training set has details for 891 of the total 2,224 passengers on board the Titanic.
  • Within the training set, 38% survived (Survived = 1) and 62% perished.
  • The average fare was £32.20, with the most expensive fare £512.32 and the least expensive was £0.

What does the distribution of the categorical values look like?

Categorical values summary

  • 65% of the passengers in our training set are males.
  • Tickets and cabin values have a high number of duplicate values.
  • Most passengers (72%) embarked at port ‘S’.

Investigating correlations between features and survival

Correlations are useful for determining what features should be included in our model to predict which passengers survive. The Seaborn data visualisation package lets us create a correlation heatmap.

Fare and class appear to be strongly correlated with survival rates. Age may show a stronger correlation when grouped into bands, making it a categorical variable rather than a continuous one. As noted previously, age has a number of missing values which we will try to fill. We can look at the rates of survival against age in more detail.

Class and age

One hypothesis for our predictive models is that the idea of ‘women and children first’ will have an impact on the survival rates of the women and children passengers. It is also likely that the more wealthy passengers in first class were given higher priority on the lifeboats.

We can see a much higher rate of survival for the younger ages (children). Next we’ll have a look at survival by class and age.

Passengers across a wide variety of ages had a high rate of survival in first class, while those in third class perished at a much greater rate. Children in second and third class had relatively better rates of survival, although many in third still did not survive.

An alternative plot that the Seaborn package offers is the violin plot:

Age appears to to be an important feature for survival prediction. To include it in our future models, we will convert the ages to categorical values with binning. The oldest passenger was 80, so we will divide the ages into 5 bins with size 16.

How much did the passengers pay for their tickets?

We know that there was not a ‘standard’ price for each class ticket. We can see the distribution of the fare amounts paid below.

The distribution of fares across the classes is fairly spread out, so we will convert the continuous fare variable into a discrete value with binning.

Survival rates improve as as the fare amounts increase. We will convert these bins into discrete values.

Men, women and port of embarkation

Next we consider the survival rates of males and females, and compare for each port of embarkation.

Survival rates across passenger classes are much better for females. We can add the embarkation port to examine the survival rates in more detail.

 

The survival rates of passengers decrease as the class decreases, except for male passengers who embarked at Port Q. Passengers who got on the ship at Port Q represent a relatively smaller proportion of the population. The best survival rates were for passengers that embarked at Port C, while the worst were for those who got on at Port S. We will fill the two empty values for port of embarkation with the most common value, S.

Siblings, spouses, parents and children

The variable SibSip indicates whether the passenger was travelling alone, or with siblings/spouse. Parch indicates the number of family members on board. We will look at the relationship between these two features and the survival rates of the passengers.

 

398 passengers travelling alone perished, as well as all passengers with 5 or more family members. This looks like an important feature for a predictive model.

Why is the survival rate for passengers with large families so poor?


All of the larger families were in third class, which had a much lower survival rate.

Similarly for the Parch feature, having no parents/children aboard or more than three resulted in a lower survival rate.

Final data cleaning for predictive modelling

The last step is to convert our categorical values currently recorded as strings to numeric values, so a machine learning model can process the features. We will also drop the features that are not required or represented by a banded value.

Finally, we can generate a new correlation matrix for the features that may be included in a predictive model.

In the next post, we will investigate a number of predictive machine learning models to see how effectively each one predicts the survival of passengers.