Exploratory Data Analysis or EDA is the process of performing critical initial analysis on data to gain an insight into the trends, patterns, and relationships among various entities present in the data set.
There are three major methods to performing EDA: Univariate, Bivariate, and Multivariate Analysis. In this article, we will be diving deep into these three methods individually. Let’s get started!
The term uni means one and thus univariate means one variable. As the name suggests this method is used for analyzing the data which have only 1 set of variables. The objective of the univariate analysis is to derive the data, define and summarize it, and analyze the pattern present in it. This is be done by looking into the mean, median, mode, dispersion, variance, range, standard deviation, etc. It does not deal with the causes or relationships in the data.
For example, the data set of prices of houses in an area. In this instance, the data is simply representing the set of numbers, i.e., one variable to represent the prices of a house.
Univariate analysis is conducted through several ways which are mostly descriptive in nature like; Frequency Distribution Tables, Histograms, Frequency Polygons, Pie Charts, etc.
Let’s take the iris dataset as our sample data set. In this dataset, the iris flower is categorized into different species based on the characteristics like sepal length, sepal width, petal length, and petal width.
We need to first import the required libraries and load the data as pandas’ data frame:
# Importing necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Reading in the dataset df = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
Let’s look at the first few rows by executing the command:
Here, setosa is a species of iris, we’ll use setosa, virginica and versicolor for our analysis.
Now to do the univariate analysis we’ll plot sepal lengths of different species in a single plot by executing the following commands:
df_setosa=df.loc[df['species']=='setosa'] df_virginica=df.loc[df['species']=='virginica'] df_versicolor=df.loc[df['species']=='versicolor'] plt.plot(df_setosa['sepal_length'],np.zeros_like(df_setosa['sepal_length']),'o') plt.plot(df_virginica['sepal_length'],np.zeros_like(df_virginica['sepal_length']),'o') plt.plot(df_versicolor['sepal_length'],np.zeros_like(df_versicolor['sepal_length']),'o') plt.xlabel('Petal length') plt.show()
Here the blue dots represent setosa species, green represent virginica and yellow represent versicolor. So, we can analyze from this graph that versicolor has a higher possibility of having a larger petal than viginica and setosa and virginica has a higher possibility of having a larger petal than setosa. So, such conclusions can be drawn out through univariate analysis.
The term bi means two and thus bivariate means one variable. As the name suggests this method is used for analyzing the data which have 2 sets of variables. It is slightly more analytical than univariate analysis. This is method is used to find out the relationship between the two variables and identify the cause of variation.
For example: along with the prices of houses in an area we also have the area of houses. Thus, we can easily plot the 2 variables and find out the relationship between the area of houses and their prices. And also, maybe predict the price of a new house given its area.
It usually has one variable as the dependent variable and another variable as an independent variable.
Bivariate analysis is conducted using Correlation coefficients and Regression analysis.
Continuing with the above iris.csv dataset, let’s plot a scatter graph between the petal length and sepal width of different species:
From the plot it can be seen that a clear regression line can be drawn for the setosa specie based on the sepal width and petal length and thus given a new flower’s characteristics just by considering the sepal width and petal length it can be identified that whether it’s setosa or not. But for virginica and versicolor we have some dots which overlap and thus not a clear regression line can be drawn and due to that reason, we need a method which is even more analytical.
It is a more complex method of analysis and is used where three or more than three variables are present. Such analysis is usually difficult as it is very hard to visualize the graph for more than 3 variables. In this type of analysis, the relationships are developed for every variable with every other variable.
For example, the complete dataset of pricing of houses containing many different fields such as the price of the house, availability time in the market, loan available for that house, location, area of the house, number of bedrooms, no. of stories, etc.
Commonly used multivariate analysis techniques include Factor Analysis, Cluster Analysis, Variance Analysis, Discriminant Analysis, Multidimensional Scaling, Principal Component Analysis, Redundancy Analysis, etc. there are over more than 20 ways to perform multivariate analysis.
For the example of iris.csv, we have 4 columns with different data on which the specie of iris depends on. We can plot a 3D graph for the 3 variables but for more than 3 variables, it is practically impossible to visualize all the variables in a single plot. So, for that one method is to plot pair-wise graphs, i.e., plotting each variable against every other variable and then analyzing on different factors.
To plot the pair-wise graphs in python we can use the seaborn library:
sns.pairplot(df, hue="species", height=2)
Here it can be seen that all the features are taken in the y-axis as well as in the x-axis. In each of the graphs, a bivariate analysis is done between different features. From these plots, we can find out the most important feature which will help us categorize a flower into a species. In the graph between petal length and petal width, it can be easily seen that it follows a positive correlation and regression analysis can be done here. Here we are also getting the probability density function when a graph is plotted between the same feature i.e., say between petal length and petal length. This is how when we have more than three features the pair grid is used to understand the relationships between the features. This type of analysis is the most basic analysis that can be done on the dataset.
Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:
- Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
- Introduction to Data Science in Python- 400,000+ students already enrolled!
- Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
- Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!