Welcome to this project tutorial on Exploratory Data Analysis (EDA) with Python. In this tutorial, you will be performing hands-on EDA on the dataset of the infamous Kaggle competition, ‘Titanic: Machine Learning from Disaster’.
You will learn how to perform general as well as problem-specific analyses to find insights from the given dataset. This will be a knowledgeable tutorial for you and will involve the use of Python libraries such as Pandas, Matplotlib and Seaborn.
Objectives
The learning objectives of this project tutorial are set out as follows:
- Learn how to download a dataset from Kaggle
- Learn how to read in a CSV file format dataset in Python
- Learn how to perform Exploratory Data Analysis (EDA)
- Learn how to visualize data insights
You can expect to have all of these objectives met by the time you reach the end of this course
Pre-requisites
If this is your first time working on Python, it may be hard for you to effectively grasp all the concepts. Therefore, the following pre-requisites are required for you to get the best out of the project tutorial:
- Solid understanding of the Python programming language
- Familiarity with Pandas, Matplotlib and Seaborn
- Interest in performing data analysis with Python
If you do not satisfy the above pre-requisites, don’t worry! You can always come back later once you are ready.
Best way to work through the tutorial
The project tutorial is not long but requires a good amount of attention from your end.
Before moving to the next lecture, we suggest you to set up your coding environment and open up your Jupyter Notebook. If you are a more advanced user of Python and have your own preferences, please feel free to choose an IDE that you prefer. However, all of the coding examples will be written for execution on Jupyter Notebook cells.
Also, if you come across any problem, please check to see if your code matches exactly with the course or not. If you still are facing errors or have some doubt, please provide your question through the comment section of the specific chapter you are stuck on.
We also recommend you join our community and get connected to our vibrant network of data science aspirants. Once you are in the community, you can share your learnings, form a study group or even get help building a project around Data Analysis.
All good? Let’s get started.
How to get the dataset?
Since most of our current and future tutorial will use data hosted on Kaggle, it might be a good idea to learn how to get a dataset from Kaggle as the first step in this course.
To get the ‘Titanic: Machine Learning from Disaster’ dataset, please follow the following steps:
1. Visit the Kaggle platform through the following URL: https://www.kaggle.com/
2. If you already have a Kaggle account, simply sign in to the platform. If you don’t, click on the ‘Register’ button at the top right of the platform and create a new account. Once you’ve done creating an account, log in to the platform with your credentials.
3. Once you’ve signed in, go to the following URL: https://www.kaggle.com/c/titanic/data
4. Scroll at the end of the page and click on the ‘I understand and agree’ button after reading the competition rules (terms and conditions of the competition). Alternatively, you can also click on ‘Join Competition’.
5. After you’ve entered the competition. Select the train.csv file and download it by clicking on the ‘download’ icon at the top right side of the UI. Place the downloaded file in your working directory.
There you have it! You are now ready to get started with the course.
Overview of the dataset
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. The train.csv file contains the data of the passengers who were on that ship including the information of whether they survived the sinking or not.
Data Dictionary
Variable | Definition | Key |
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | Male / Female |
Age | Age in years | – |
sibsp | # of siblings / spouses aboard the Titanic | – |
parch | # of parents / children aboard the Titanic | – |
ticket | Ticket number | – |
fare | Passenger fare | – |
cabin | Cabin number | – |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Introduction to Exploratory Data Analysis
Whenever a data scientist receives a new dataset, does he/she start building a Machine Learning model directly?
Well, the answer is no. They start exploring the dataset in a quest to better understand the different features present in the dataset and their relationship to each other. In other words, they perform Exploratory Data Analysis or simply, EDA.
As a formal definition, Exploratory Data Analysis (EDA) is an approach for data analysis that makes use of various analytical and graphical techniques to:
- better understand the data
- extract important variables for data modeling
- detect outliers and anomalies
- generate and test a (or multiple) hypotheses about the data
A good data scientist has excellent EDA skills and in this course, we will be focusing on harnessing that within you.
First of all importing necessary libraries to work with the dataset in Python.
# Importing necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
As you may recall from your knowledge of Python,
- NumPy is used for numerical computations
- Pandas is used for data processing as well as for CSV file I/O
- Matplotlib and Seaborn are used for data visualization
Next, reading in the data using the read_csv() method of Pandas and looking at the first five rows using the head() method.
# Reading in the data df = pd.read_csv('train.csv') df.head()
This certainly gives us a perspective of what kind of data we are dealing with. Let us look at the shape of the dataframe to better understand how many rows and columns are there.
# Finding the shape of the dataframe df.shape
Output: (891, 12)
There are a total of 891 rows and 12 columns.
Now, getting a basic statistical description of the columns containing numeric in the dataset using the describe() method.
df.describe()
Interesting! The ‘count’ of ‘Age’ is not 891 which means that there are missing values in the dataset. We should certainly check the entire dataframe for missing values as a first step.
The isnull() method is useful in finding which data values are null in the dataframe.
# Checking for null values in the dataframe df.isnull()
The places where the values are ‘True’ is where the dataset contains null data. Now, summing up all the values that are ‘True’ in the dataset to find the number of missing values per column. We will be using the sum() method for this.
# Checking for total null values per column df.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
As we can see, there are 177 missing values in ‘Age’ column, 687 missing values in ‘Cabin’ column and 2 missing values in ‘Embarked’ column.
In most real-life datasets, there can be a lot of missing values and there are different ways to fill in these missing values. If you are interested in learning about that in a separate course, please let us know in the comment section!
Types of Features
By now, you must have already had a feel of the data. Therefore, it is the right time to talk about the different types of features you are looking at.
Numerical/Continuous features: A feature is said to be numerical or continuous if it can take values between any two points or between the minimum or maximum values in the features column. For example, ‘Age’ is a continuous feature in the dataset.
Categorical features: A categorical feature is one that has two or more categories and each value in that feature can be categorised by them. For example, gender is a categorical variable having two categories (male and female). ‘Sex’ and ‘Embarked’ are categorical features in the dataset.
Ordinal features: An ordinal feature is similar to categorical features, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is an ordinal variable. ‘PClass’ is an ordinal feature in the dataset.
DateTime features: A feature is said to be a DateTime feature if the feature holds DateTime values. For example, a feature with the value ‘2020/02/01 01:01:00″ is a DateTime feature. There are no DateTime features in the given dataset.
Co-ordinate features: A feature is said to be a co-ordinate feature if the feature holds co-ordinate values. For example, a feature with the value ‘(27.7172, 85.3240)’ is a co-ordinate feature. There are no co-ordinate features in the given dataset.
Frequency features: A feature is said to be a frequency feature if the feature holds a count of items as its value. For example, a feature with the value ‘200’ is a frequency feature if it represents the count of 200 people who are on the Titanic. ‘SibSp’ is a frequency feature.
Now that we have understood a bit more about the data, it is time to perform some in-depth analysis. In this chapter we will be finding the survival distribution of passengers relative to various features.
First, let us look at the distribution of survivors (1) vs non-survivors (0). The value_counts() method can provide us with the frequency of occurrence of unique values of our target column.
# Finding the frequency count of survivors (1) and non-survivors (0) df['Survived'].value_counts()
0 549 1 342 Name: Survived, dtype: int64
There looks to be a 38% survival rate, i.e., 549 passengers lost their lives during the sinking of the Titanic whereas 342 passengers survived.
Analyzing the survival distribution of passengers according to their features
a. Gender
# Plotting the number of survivors and non-survivors according to gender fig = plt.figure() sns.countplot('Sex', hue='Survived', data=df) fig.suptitle('Survival distribution of male and female') plt.show()
With this visualization, we can see that a lot of male passengers lost their lives in comparison to female passengers. This is an interesting find and this may have been caused due to the fact that women were the first one to leave the ship when the ship made an initial impact with the iceberg.
b. Pclass
# Plotting the number of survivors and non-survivors according to Pclass fig = plt.figure() sns.countplot('Pclass', hue='Survived', data=df) fig.suptitle('Survival distribution of Pclass') plt.show()
The ‘Pclass’ column represents the class of ticket purchased by a passenger. It can be observed that a large number of passengers of ticket class ‘3’ failed to survive the sinking.
c. SibSp
# Plotting the number of survivors and non-survivors according to SibSp fig = plt.figure() sns.countplot('SibSp', hue='Survived', data=df) fig.suptitle('Survival distribution of SibSp') plt.show()
The ‘SibSp’ column represents the number of siblings/spouses aboard the Titanic. A lot of passengers didn’t have siblings/spouses and thus, we can observe a high mortality rate in such cases.
d. Embarked
# Plotting the number of survivors and non-survivors according to Embarked fig = plt.figure() sns.countplot('Embarked', hue='Survived', data=df) fig.suptitle('Survival distribution of Embarked') plt.show()
The ‘Embarked’ column represents the port of embarkation. Therefore, we can observe that most of the passengers embarked the ship from Southampton and thus, the mortality rate for that port’s passenger is higher.
e. Parch
# Plotting the number of survivors and non-survivors according to Parch fig = plt.figure() sns.countplot('Parch', hue='Survived', data=df) fig.suptitle('Survival distribution of Parch') plt.show()
The ‘Parch’ column represents the number of parents/children aboard the Titanic. The survival distribution is very similar to the survival distribution of ‘SibSp’ column.
Analyzing the relationship behind ‘Pclass’ and ‘Fare’
As mentioned above, the ‘Pclass’ column represents the class of ticket purchased by a passenger. It would be nice to understand what is the mean price for fare prices in the various ticket classes.
# Grouping the data by 'Pclass' and finding the mean of 'Fare' in each group df.groupby(['Pclass'])[['Fare']].mean()
So, ticket class of ‘1’ is the most expensive on whereas ‘3’ is the least expensive.
Looking at the graph we plotted above for the survival distribution of Pclass, it can be observed that a large number of passengers of ticket class ‘3’ failed to survive the sinking.
This brings up three interesting questions:
Q. Is the ratio of survivors and non-survivors similar for passengers in different ticket class?
The answer is no. Just look at the above bar graph and you’ll see the difference much clearly.
Q. Were the passengers from low-priced ticket classes ignored and the passengers for high-priced ticket classes rescued?
The answer is maybe. There were a lot of passengers in low-priced ticket classes but the number of survivors is nearly the same for all three ticket classes.
Q. Is the survival rate of male and female passengers biased by their ticket class (Pclass)?
The answer right now is we don’t know. So, let’s work on finding the answer
This part of the lesson might get tricky but bear with us since we are now trying to find insights from three different columns simultaneously. The dataframe printed below shows us ‘Survival distribution per Sex per Pclass’.
# Grouping the data by 'Pclass', 'Sex' and 'Survived' and finding the count of 'Sex' in each group # Also, renaming the outermost column name from 'Sex' to 'Count' df.groupby(['Pclass','Sex','Survived'])[['Sex']].count().rename(columns={'Sex':'Count'})
First, let us take a look at the survival distribution of female per Pclass. If you look closely, in ‘Pclass 1’ almost all the female survived except 3. Similarly, in ‘Pclass 2’ almost all the female survived except 6. However, in ‘Pclass 3’ there is an equal distribution of female survivors/non-survivors, i.e. 72/72. With this information, we can certainly say that the survival rate of female passengers is biased by the ticket class they are in.
Next, let us look at the survival distribution of male per Pclass. In this case, the pattern isn’t quite distinguishable but if you take the total number of male passengers in account, it certainly becomes easier. In ‘Pclass 1’, 77 male passengers lost their lives out of the total 122 (~63% mortality rate). Similarly, in ‘Pclass 2’, 91 male passengers lost their lives out of the total 108 (~84% mortality rate). Finally, in ‘Pclass 3’, 300 male passengers lost their lives out of the total 347 (~86% mortality rate). With this information, we can only say that the survival rate of male passengers is not biased by the ticket class they are in.
We successfully answered all three of our questions based on the available data. We know it is hard to write ‘maybe’ as an answer but we just don’t have the necessary amount of insight to give a bold ‘Yes’ or ‘No’ answer.
By the way, did you realize that we just tied up multiple analysis together to frame/answer questions that were completely out of the picture at the beginning of the analysis. This is what EDA is all about and the approach you take to analyze a dataset is very important.
That’s it for this project tutorial!