Analyzing Time-Series DataNovember 29, 2020 2020-12-03 21:16
Analyzing Time-Series Data
Analyzing Time-Series Data
In the previous lesson, we got familiar with Time-Series Data. In this lesson, we will be learning to perform some basic analysis (both statistical analysis as well as graphical analysis) on Time-Series Data. We will be continuing with the same DataFrame (
df) we used in our previous lesson. However, in this lesson, we will be analyzing only the temperature column (
T (degC)) of the DataFrame. So we can make a new DataFrame taking only one column from our original DataFrame as,
# Taking only one column df_temp = pd.DataFrame(df['T (degC)']) df_temp.index = date_time # Displaying top 5 rows df_temp.head()
We will now be working only with the
df_temp DataFrame in this entire lesson.
Graphical Analysis on Time-Series Data
Data Visualization is a very powerful technique for analyzing data. There are many techniques for visualizing time-series data. In this chapter, we will learn about Line Plots, Histogram and Density Plots, Box Plots, and Calendar Heatmaps. For this, we will be using Matplotlib and Calmap libraries. So before diving into creating different plots, we need to import the necessary libraries as,
import matplotlib import matplotlib.pyplot as plt import calmap # Defining figure size matplotlib.rcParams['figure.figsize'] = (12, 6) %matplotlib inline
1. Line Plots
Line plot is a type of chart that displays information as a series of data points connected by straight lines. It is one of the most commonly used plots for visualizing time-series data. In such plots, time is represented on the x-axis. A line plot can easily be created using pandas.DataFrame.plot function as,
# Plot all temperature df_temp.plot() plt.show()
In the above graph, we plotted the entire data. However, it is sometimes necessary to plot the data for a specific time period. The following example shows how we can do that,
# Plotting for a specific time period start_date_time = '2009-01-01 01:00:00' finish_date_time = '2009-02-01 01:00:00' data = df_temp[start_date_time:finish_date_time] data['T (degC)'].plot()
Line plots are also helpful to understand how the pattern of our data varies after every certain interval (eg: after every year). For this, we can group our data by year, and then plot the data of every year as,
# Group values of different years group_years = df_temp.groupby(df_temp.index.year) years = pd.DataFrame() for name, group in group_years: values = group['T (degC)'].values years[name] = pd.Series(values) # Plot data years.plot(subplots=True, legend=True, figsize=(12,8)) plt.show()
2. Histogram and Density Plot
A histogram is an approximate representation of the distribution of numerical data. Some of the time-series forecasting methods assume certain distribution of data (such as bell curve or normal distribution). So, plotting a histogram will give us a rough idea about the distribution of our data. Histograms can be plotted using the pandas.Series.hist function as,
# Getting data as pandas.Series temp = df_temp['T (degC)'] # Plotting the Series temp.hist() plt.show()
Another plot that can provide us with a better idea about our data distribution is a Density Plot. For simplicity, it can be seen as a smoothed version of the histogram plot. It can be created using the pandas.Series.plot function with kind as ‘kde’ (Kernel Density Estimate) as,
3. Box Plots
Box plots are useful to summarize the distribution of our data into different boxes. If you are not familiar with box plots, here is a quick anatomy of a box plot,
We can create box plots for each year in a similar way we created histogram plots,
# Group values of different years group_years = df_temp.groupby(df_temp.index.year) years = pd.DataFrame() for name, group in group_years: values = group['T (degC)'].values years[name] = pd.Series(values) # Construct Box Plots years.boxplot() plt.show()
From the above box plot, we can see that almost all observations taken at different years lie within a similar range. This can also be justified by the fact that the temperature of a certain place almost remains within a certain range every year.
4. Calendar Heatmap
Calendar heatmap is a kind of plot that shows the intensity of data over days of a year using color gradients. The darker shade in the heatmap indicates a higher value. We will be using the calmap library for creating our calendar heatmap as,
import calmap YEAR = 2010 year_data = df_temp[df_temp.index.year == YEAR] calmap.calendarplot(year_data['T (degC)']) plt.show()
Statistical Analysis on Time-Series Data
Statistical approach is another important method of analyzing time-series data. Such analysis involves computing various metrics of the series such as mean, medium, etc. However, in time-series data, we use the concept of rolling windows for computing different values.
Rolling windows split the data into time windows. The different windows created overlap and “roll” along at the same frequency as the data, so the transformed time series is at the same frequency as the original time series. Statistical metrics such as mean and median are calculated over only those observations inside the rolling windows.
Let us compute the rolling mean and median over a window size of 48 (corresponding to 48hrs/2days of observation),
df_temp['rolling_mean'] = df_temp['T (degC)'].rolling(window = 48).mean() df_temp['rolling_median'] = df_temp['T (degC)'].rolling(window = 48).median() df_temp.tail()
Patterns in Time-Series Data
There can be various patterns underlying time series data. It is often helpful to split a time series into several components, each representing an underlying pattern category. Such splitting is very helpful for Exploratory Data Analysis (EDA).
One of the most common splitting technique is to split the data into three different patterns: trend, seasonality and the error terms. A trend is observed when there is an increasing or decreasing slope observed in the time series. Whereas seasonality is observed when there is a distinct repeated pattern observed between regular intervals due to seasonal factors. It could be because of the month of the year, the day of the month, weekdays or even time of the day.
So, a time series may be imagined as a combination of the trend, seasonality, and error terms. We can use the statsmodel module in Python to decompose the time series into error, trend, and seasonality. As we have 24 rows of data for each day (as the reading is taken every hour in a day), we will be dividing the length of the DataFrame by 24 and using that as frequency during the decompose.
from statsmodels.tsa.seasonal import seasonal_decompose result = seasonal_decompose(df_temp['T (degC)'], model='additive', freq=len(df_temp)//(24)) fig = result.plot() plt.show()
Stationary and Non-Stationary Time-Series Data
A stationary time series data is that type of data where the values of the series are not a function of time, i.e., the statistical properties of the series like mean, and variance are constant over time.
Most statistical forecasting methods are designed to work on a stationary time series. So it is generally suggested to check the series for stationarity before forecasting.
How to test for stationarity of a time series data?
The stationarity of a times series data can be checked using a statistical test called as ‘Unit Root Tests’. There are different variants of Unit Root Test. In this lesson, we will be covering one of the most commonly used variant of it, the Augmented Dickey Fuller test (ADF test).
In ADF test assumes two hypothesis,
- Null Hypothesis: The series is stationary.
- Alternative Hypothesis: The series is not stationary.
Then, the P-value is computed and checked against the significance level (0.05). If the P-Value in ADF test is less than the significance level (0.05), we reject the null hypothesis and accept the alternative hypothesis.
The following block of code illustrates how we can perform ADF test using python,
from statsmodels.tsa.stattools import adfuller def adf_check(time_series): result = adfuller(time_series) print("Augmented Dicky-Fuller Test") labels = ['ADF Test Statistic', 'p-value', '# of lags','# of observations used'] for value, label in zip(result, labels): print(label + " : " + str(value)) if result <= 0.05: print("Strong evidence against null hypothesis.\nReject Null Hypothesis.\nData has no unit root and is stationary.") else: print("Weak evidence against null hypothesis.\nFail to reject Null Hypothesis.\nData has a unit root and is non-stationary.") # Performing adf test for the first 10,000 rows of data (to save computational time) adf_check(df_temp['T (degC)'][:10000])
Augmented Dicky-Fuller Test ADF Test Statistic : -3.4650592619838014 p-value : 0.008931120463794082 of lags : 38 of observations used : 9961 Strong evidence against null hypothesis. Reject Null Hypothesis. Data has no unit root and is stationary.
How to make a series stationary?
In the above example, our series was stationary? But what if the series was non-stationary? In such cases, we generally transform the series into stationary.
One of the most common approaches for making a series stationary is Differencing. In this method, we compute the difference of consecutive terms in the series. It is typically performed to get rid of the varying mean.
Y_t is the value at time ‘t’, then the first difference of Y = Yt – Yt-1. In simpler terms, differencing the series is nothing but subtracting the next value by the current value.
For example, consider the following series: [5, 8, 2, 1, 10].
We can perform differencing on the series by subtracting the next value by current value as,
[8-5, 2-8, 1-2, 10-1] = [3, -6, -1, 9]
With this we have come to the end of this lesson. Head on to the next lesson on ‘Creating Helper Functions‘ to start working on building our forecasting model.