Time Series Analysis using Python

Table of Contents
Primary Item (H2)

In this lesson, we will be learning to perform some basic time series analysis using Python (both statistical as well as graphical analysis) on time series data.

We will be continuing with the same DataFrame (df) we used in our previous lesson. However, in this lesson, we will be analyzing only the temperature column (T (degC)) of the DataFrame. So we can make a new DataFrame taking only one column from our original DataFrame as,

# Taking only one column
df_temp = pd.DataFrame(df['T (degC)'])
df_temp.index = date_time 

# Displaying top 5 rows
df_temp.head()
Time-series Climate data

We will now be working only with the df_temp DataFrame in this entire lesson.


Graphical Analysis on Time-Series Data

Data Visualization is a very powerful technique for analyzing data. There are many techniques for visualizing time-series data. In this chapter, we will learn about Line Plots, Histogram and Density Plots, Box Plots, and Calendar Heatmaps. For this, we will be using Matplotlib and Calmap libraries. So before diving into creating different plots, we need to import the necessary libraries as,

import matplotlib
import matplotlib.pyplot as plt
import calmap

# Defining figure size
matplotlib.rcParams['figure.figsize'] = (12, 6)
%matplotlib inline

1. Line Plots

Line plot is a type of chart that displays information as a series of data points connected by straight lines. It is one of the most commonly used plots for visualizing time-series data. In such plots, time is represented on the x-axis. A line plot can easily be created using pandas.DataFrame.plot function as,

# Plot all temperature
df_temp.plot()
plt.show()
Line Plot - Time Series Data Analysis

In the above graph, we plotted the entire data. However, it is sometimes necessary to plot the data for a specific time period. The following example shows how we can do that,

# Plotting for a specific time period
start_date_time = '2009-01-01 01:00:00'
finish_date_time = '2009-02-01 01:00:00'
data = df_temp[start_date_time:finish_date_time]

data['T (degC)'].plot()
Line Plot 2 - Time Series Data Analysis

Line plots are also helpful to understand how the pattern of our data varies after every certain interval (eg: after every year). For this, we can group our data by year, and then plot the data of every year as,

# Group values of different years
group_years = df_temp.groupby(df_temp.index.year)
years = pd.DataFrame()

for name, group in group_years:
    values = group['T (degC)'].values
    years[name] = pd.Series(values)
 
# Plot data   
years.plot(subplots=True, legend=True, figsize=(12,8))
plt.show()
Time Series Plot for different years

2. Histogram and Density Plot

A histogram is an approximate representation of the distribution of numerical data. Some of the time-series forecasting methods assume certain distribution of data (such as bell curve or normal distribution). So, plotting a histogram will give us a rough idea about the distribution of our data. Histograms can be plotted using the pandas.Series.hist function as,

# Getting data as pandas.Series
temp = df_temp['T (degC)']

# Plotting the Series
temp.hist()
plt.show()
Histogram plot - Time Series Data Analysis

Another plot that can provide us with a better idea about our data distribution is a Density Plot. For simplicity, it can be seen as a smoothed version of the histogram plot. It can be created using the pandas.Series.plot function with kind as 'kde' (Kernel Density Estimate) as,

temp.plot(kind='kde')
plt.show()
KDE Plot - Time Series Data Analysis

3. Box Plots

Box plots are useful to summarize the distribution of our data into different boxes. If you are not familiar with box plots, here is a quick anatomy of a box plot,

Box Plot - Time Series Data Analysis
Source: Quant Girl

We can create box plots for each year in a similar way we created histogram plots,

# Group values of different years
group_years = df_temp.groupby(df_temp.index.year)
years = pd.DataFrame()

for name, group in group_years:
    values = group['T (degC)'].values
    years[name] = pd.Series(values)

# Construct Box Plots    
years.boxplot()
plt.show()
Time Series Box Plot

From the above box plot, we can see that almost all observations taken at different years lie within a similar range. This can also be justified by the fact that the temperature of a certain place almost remains within a certain range every year.

4. Calendar Heatmap

Calendar heatmap is a kind of plot that shows the intensity of data over days of a year using color gradients. The darker shade in the heatmap indicates a higher value. We will be using the calmap library for creating our calendar heatmap as,

import calmap

YEAR = 2010
year_data = df_temp[df_temp.index.year == YEAR]

calmap.calendarplot(year_data['T (degC)'])
plt.show()

Calendar Heatmap - Time Series Data Analysis

Statistical Analysis on Time-Series Data

The statistical approach is another important method of analyzing time-series data. Such analysis involves computing various metrics of the series such as mean, medium, etc. However, in time-series data, we use the concept of rolling windows for computing different values.

Rolling windows split the data into time windows. The different windows created overlap and “roll” along at the same frequency as the data, so the transformed time series is at the same frequency as the original time series. Statistical metrics such as mean and median are calculated over only those observations inside the rolling windows.

Let us compute the rolling mean and median over a window size of 48 (corresponding to 48hrs/2days of observation),

df_temp['rolling_mean'] = df_temp['T (degC)'].rolling(window = 48).mean()
df_temp['rolling_median'] = df_temp['T (degC)'].rolling(window = 48).median()

df_temp.tail()
Rolling mean and median - Time Series Data Analysis

Patterns in Time-Series Data

There can be various patterns underlying time series data.  It is often helpful to split a time series into several components, each representing an underlying pattern category. Such splitting is very helpful for Exploratory Data Analysis (EDA).

One of the most common splitting technique is to split the data into three different patterns: trend, seasonality and the error terms. A trend is observed when there is an increasing or decreasing slope observed in the time series. Whereas seasonality is observed when there is a distinct repeated pattern observed between regular intervals due to seasonal factors. It could be because of the month of the year, the day of the month, weekdays or even time of the day.

So, a time series may be imagined as a combination of the trend, seasonality, and error terms. We can use the statsmodel module in Python to decompose the time series into error, trend, and seasonality. As we have 24 rows of data for each day (as the reading is taken every hour in a day), we will be dividing the length of the DataFrame by 24 and using that as frequency during the decompose.

from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(df_temp['T (degC)'], model='additive', freq=len(df_temp)//(24))
fig = result.plot()
plt.show()
Seasonality, Trend and Residual in Time Series Data Analysis

Stationary and Non-Stationary Time-Series Data

A stationary time series data is that type of data where the values of the series are not a function of time, i.e., the statistical properties of the series like mean, and variance are constant over time.

Most statistical forecasting methods are designed to work on a stationary time series. So it is generally suggested to check the series for stationarity before forecasting.

How to test for stationarity of time series data?

The stationarity of a times series data can be checked using a statistical test called ‘Unit Root Tests’. There are different variants of the Unit Root Test. In this lesson, we will be covering one of the most commonly used variants of it, the Augmented Dickey-Fuller test (ADF test).

In ADF test assumes two hypotheses,

  • Null Hypothesis: The series is stationary.
  • Alternative Hypothesis: The series is not stationary.

Then, the P-value is computed and checked against the significance level (0.05). If the P-Value in ADF test is less than the significance level (0.05), we reject the null hypothesis and accept the alternative hypothesis.

The following block of code illustrates how we can perform the ADF test using python,

from statsmodels.tsa.stattools import adfuller


def adf_check(time_series):
    result = adfuller(time_series)
    print("Augmented Dicky-Fuller Test")
    labels = ['ADF Test Statistic', 'p-value', '# of lags','# of observations used']
    
    for value, label in zip(result, labels):
        print(label + " : " + str(value))
        
    if result[1] <= 0.05:
        print("Strong evidence against null hypothesis.\nReject Null Hypothesis.\nData has no unit root and is stationary.")
    else:
        print("Weak evidence against null hypothesis.\nFail to reject Null Hypothesis.\nData has a unit root and is non-stationary.")

# Performing adf test for the first 10,000 rows of data (to save computational time)
adf_check(df_temp['T (degC)'][:10000])
Augmented Dicky-Fuller Test
ADF Test Statistic : -3.4650592619838014
p-value : 0.008931120463794082
of lags : 38
of observations used : 9961
Strong evidence against null hypothesis.
Reject Null Hypothesis.
Data has no unit root and is stationary.

How to make a series stationary?

In the above example, our series was stationary? But what if the series was non-stationary? In such cases, we generally transform the series into stationary.

One of the most common approaches for making a series stationary is Differencing. In this method, we compute the difference of consecutive terms in the series. It is typically performed to get rid of the varying mean.

If Y_t is the value at time ‘t’, then the first difference of Y = Yt – Yt-1. In simpler terms, differencing the series is nothing but subtracting the next value by the current value.

For example, consider the following series: [5, 8, 2, 1, 10].

We can perform differencing on the series by subtracting the next value by current value as,

[8-5, 2-8, 1-2, 10-1] = [3, -6, -1, 9]

With this, we have come to the end of this lesson. Head on to the next lesson on 'Creating Helper Functions' to start working on building our forecasting model.


Time Series Analysis using PythonTime Series Analysis using Python

Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:

  1. Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
  2. Introduction to Data Science  in Python- 400,000+ students already enrolled!
  3. Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
  4. Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!
Written by
The Click Reader
At The Click Reader, we are committed to empowering individuals with the tools and knowledge needed to excel in the ever-evolving field of data science. Our sole focus is delivering a world-class data science bootcamp that transforms beginners and upskillers into industry-ready professionals.

Interested In Data Science Bootcamp?
Request more info now.

Lead Collection Form
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram