Thanks to theidioms.com

Analyzing the sinking of the Titanic – Data Analysis with Python (Course V)

Analyzing the sinking of the Titanic – Data Analysis with Python (Course V)

Exploratory Data Analysis – Part 2

Now that we have understood a bit more about the data, it is time to perform some in-depth analysis. In this chapter we will be finding the survival distribution of passengers relative to various features.

First, let us look at the distribution of survivors (1) vs non-survivors (0). The value_counts() method can provide us with the frequency of occurrence of unique values of our target column.

```# Finding the frequency count of survivors (1) and non-survivors (0)

df['Survived'].value_counts()```
```0 549
1 342
Name: Survived, dtype: int64```

There looks to be a 38% survival rate, i.e., 549 passengers lost their lives during the sinking of the Titanic whereas 342 passengers survived.

Analyzing the survival distribution of passengers according to their features

a. Gender

```# Plotting the number of survivors and non-survivors according to gender

fig = plt.figure()
sns.countplot('Sex', hue='Survived', data=df)
fig.suptitle('Survival distribution of male and female')
plt.show()```

With this visualization, we can see that a lot of male passengers lost their lives in comparison to female passengers. This is an interesting find and this may have been caused due to the fact that women were the first one to leave the ship when the ship made an initial impact with the iceberg.

b. Pclass

```# Plotting the number of survivors and non-survivors according to Pclass

fig = plt.figure()
sns.countplot('Pclass', hue='Survived', data=df)
fig.suptitle('Survival distribution of Pclass')
plt.show()```

The ‘Pclass’ column represents the class of ticket purchased by a passenger. It can be observed that a large number of passengers of ticket class ‘3’ failed to survive the sinking.

c. SibSp

```# Plotting the number of survivors and non-survivors according to SibSp

fig = plt.figure()
sns.countplot('SibSp', hue='Survived', data=df)
fig.suptitle('Survival distribution of SibSp')
plt.show()```

The ‘SibSp’ column represents the number of siblings/spouses aboard the Titanic. A lot of passengers didn’t have siblings/spouses and thus, we can observe a high mortality rate in such cases.

d. Embarked

```# Plotting the number of survivors and non-survivors according to Embarked

fig = plt.figure()
sns.countplot('Embarked', hue='Survived', data=df)
fig.suptitle('Survival distribution of Embarked')
plt.show()```

The ‘Embarked’ column represents the port of embarkation. Therefore, we can observe that most of the passengers embarked the ship from Southampton and thus, the mortality rate for that port’s passenger is higher.

e. Parch

```# Plotting the number of survivors and non-survivors according to Parch

fig = plt.figure()
sns.countplot('Parch', hue='Survived', data=df)
fig.suptitle('Survival distribution of Parch')
plt.show()```

The ‘Parch’ column represents the number of parents/children aboard the Titanic. The survival distribution is very similar to the survival distribution of ‘SibSp’ column.

Analyzing the relationship behind ‘Pclass’ and ‘Fare’

As mentioned above, the ‘Pclass’ column represents the class of ticket purchased by a passenger. It would be nice to understand what is the mean price for fare prices in the various ticket classes.

```# Grouping the data by 'Pclass' and finding the mean of 'Fare' in each group

df.groupby(['Pclass'])[['Fare']].mean()```

So, ticket class of ‘1’ is the most expensive on whereas ‘3’ is the least expensive.

Looking at the graph we plotted above for the survival distribution of Pclass, it can be observed that a large number of passengers of ticket class ‘3’ failed to survive the sinking.

This brings up three interesting questions:

Q. Is the ratio of survivors and non-survivors similar for passengers in different ticket class?

The answer is no. Just look at the above bar graph and you’ll see the difference much clearly.

Q. Were the passengers from low-priced ticket classes ignored and the passengers for high-priced ticket classes rescued?

The answer is maybe. There were a lot of passengers in low-priced ticket classes but the number of survivors is nearly the same for all three ticket classes.

Q. Is the survival rate of male and female passengers biased by their ticket class (Pclass)?

The answer right now is we don’t know. So, let’s work on finding the answer

This part of the lesson might get tricky but bear with us since we are now trying to find insights from three different columns simultaneously. The dataframe printed below shows us ‘Survival distribution per Sex per Pclass’.

```# Grouping the data by 'Pclass', 'Sex' and 'Survived' and finding the count of 'Sex' in each group
# Also, renaming the outermost column name from 'Sex' to 'Count'

df.groupby(['Pclass','Sex','Survived'])[['Sex']].count().rename(columns={'Sex':'Count'})```

First, let us take a look at the survival distribution of female per Pclass. If you look closely, in ‘Pclass 1’ almost all the female survived except 3. Similarly, in ‘Pclass 2’ almost all the female survived except 6. However, in ‘Pclass 3’ there is an equal distribution of female survivors/non-survivors, i.e. 72/72. With this information, we can certainly say that the survival rate of female passengers is biased by the ticket class they are in.

Next, let us look at the survival distribution of male per Pclass. In this case, the pattern isn’t quite distinguishable but if you take the total number of male passengers in account, it certainly becomes easier. In ‘Pclass 1’, 77 male passengers lost their lives out of the total 122 (~63% mortality rate). Similarly, in ‘Pclass 2’, 91 male passengers lost their lives out of the total 108 (~84% mortality rate). Finally, in ‘Pclass 3’, 300 male passengers lost their lives out of the total 347 (~86% mortality rate). With this information, we can only say that the survival rate of male passengers is not biased by the ticket class they are in.

We successfully answered all three of our questions based on the available data. We know it is hard to write ‘maybe’ as an answer but we just don’t have the necessary amount of insight to give a bold ‘Yes’ or ‘No’ answer.

By the way, did you realize that we just tied up multiple analysis together to frame/answer questions that were completely out of the picture at the beginning of the analysis. This is what EDA is all about and the approach you take to analyze a dataset is very important.