Getting Started with Time-Series Data

Table of Contents

Before we dive into time-series forecasting with TensorFlow 2.0, we first need to be familiar with time-series data and how we can manipulate it using Python and TensorFlow.

In this lesson, we will learn to perform data pre-processing, data visualization, feature engineering, and training/validation/testing set divisions, etc. on time-series data.

1. Importing necessary libraries

First, let us import some essential Python libraries that will be used later in this chapter for manipulating time-series data.

import os
import datetime

# For data manipulation
import numpy as np
import pandas as pd

# For data visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# For building model and loading dataset
import tensorflow as tf

# Set basic configurations
mpl.rcParams['figure.figsize'] = (8, 6)
mpl.rcParams['axes.grid'] = False
%matplotlib inline

Now that we've imported the necessary libraries, let us import and visualize the dataset.

2. Importing and visualizing the dataset

The dataset that we will be using for this tutorial is the weather time series dataset which contains 14 different features such as humidity, air temperature, etc. All of these recorded data are separated by 10 minutes of time.

We will be using the get_file() module of tf.keras to download the dataset and save it as a CSV file.

zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname='jena_climate_2009_2016.csv.zip',
    extract=True)
csv_path, _ = os.path.splitext(zip_path)

It may take about 20-60 seconds to download the dataset based on your internet connection. After downloading, let us load the data and have a look at the first 5 rows of it.

# Reading in the dataset
df = pd.read_csv(csv_path)

# Looking at the first five rows of the DataFrame
df.head()

As we can see in the Date Time column, the data is recorded at the interval of every 10 minutes. However, to make this tutorial easier to digest in a single go, we will be sub-sampling the data into 1-hour intervals rather than 10 minutes.

# slice [start:stop:step], starting from index 5 take every 6th record.
df = df[5::6]

# Store the datetime values in a separate variable for future processing
date_time = pd.to_datetime(df.pop('Date Time'), format='%d.%m.%Y %H:%M:%S')

# Looking at the first five rows of the DataFrame
df.head()

As we can see, we have successfully sub-sampled our original dataset. You can also apply a similar technique to sub-sample other time-series data you can encounter in the future. Let us now visualize how some of the features look in relation to the time.

plot_cols = ['T (degC)', 'p (mbar)', 'rho (g/m**3)'] # Columns we want to plot
plot_features = df[plot_cols] # Getting the columns
plot_features.index = date_time # Setting the index as date time
_ = plot_features.plot(subplots=True)

Since our dataset has a huge number of rows, the line plots drawn above can be somewhat difficult to understand. To take a closer look at the data, we can also plot only a few data points as shown below,

# Taking only first 480 points
plot_features = df[plot_cols][:480]
plot_features.index = date_time[:480]

# Plotting
_ = plot_features.plot(subplots=True)

3. Data cleaning time-series data

Now, we are going to clean the time-series data. For that, let us have a look at the statistical values of the dataset that we are working with.

df.describe().transpose()

As we can see, the min value of the wind velocity, wv (m/s) and max. wv (m/s) columns looks incorrect that is -9999. Let us replace it with zeroes.

# Getting indices of wv and max. wv with value -9999
bad_wv = df['wv (m/s)'] == -9999.0
bad_max_wv = df['max. wv (m/s)'] == -9999.0

# Repalcing the incorrect values with 0.0
df.loc[bad_wv,'wv (m/s)']  = 0.0
df.loc[bad_max_wv, 'max. wv (m/s)']  = 0.0

# Checking if the above inplace edits are reflected in the DataFrame
df['wv (m/s)'].min()

0.0

4. Feature Engineering time-series data

To build an accurate model, we should spend some time on feature engineering by converting the data into appropriate formats. In this section, we will learn how to perform feature engineering on time-series data.

Let us convert the wind direction and wind velocity columns to a vector with x and y components.

wv = df.pop('wv (m/s)')
max_wv = df.pop('max. wv (m/s)')

# Convert to radians
wd_rad = df.pop('wd (deg)')*np.pi / 180

# Calculate the wind x and y components
df['Wx'] = wv*np.cos(wd_rad)
df['Wy'] = wv*np.sin(wd_rad)

# Calculate the max wind x and y components
df['max Wx'] = max_wv*np.cos(wd_rad)
df['max Wy'] = max_wv*np.sin(wd_rad)

We can also convert the DateTime feature into multiple features.

timestamp_s = date_time.map(datetime.datetime.timestamp)

day = 24*60*60
year = (365.2425)*day

df['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day))
df['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day))
df['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year))
df['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))

5. Splitting and normalizing time-series data

For testing the predictions of our forecasting model, we need to split the data into a validation set as well as a testing set. We'll be performing a 70:30:10 split where 70% of the data will be our training dataset, 30% of the dataset will be our validation dataset and 10% data of the data will be our validation dataset.

# Dictionary of column names and their indices, i.e., assigning indices to column names
column_indices = {name: i for i, name in enumerate(df.columns)}

# Number of rows
n = len(df)

#  Splitting the dataset with a 70:20:10 split
train_df = df[0:int(n*0.7)] # From 0% to 70%
val_df = df[int(n*0.7):int(n*0.9)] # From 70% to 90%
test_df = df[int(n*0.9):] # All above 90%

# Number of features in our dataset
num_features = df.shape[1]
print(f'Total number of features: {num_features}')

Total number of features: 15

Next, let us normalize the data using the mean and standard deviation of the training dataset.

train_mean = train_df.mean()
train_std = train_df.std()

train_df = (train_df - train_mean) / train_std
val_df = (val_df - train_mean) / train_std
test_df = (test_df - train_mean) / train_std

A violin plot will help us understand about the distribution of the normalized data.

df_std = (df - train_mean) / train_std
df_std = df_std.melt(var_name='Column', value_name='Normalized')
plt.figure(figsize=(12, 6))
ax = sns.violinplot(x='Column', y='Normalized', data=df_std)
_ = ax.set_xticklabels(df.keys(), rotation=90)

Thus, in this chapter, we got familiar with performing data pre-processing, data visualization, feature engineering, and training/validation/testing set divisions, etc. on time-series data. In the next lesson on 'Time-series Analysis using Python', you will get introduced to some basics of time-series forecasting.

Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:

Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
Introduction to Data Science in Python- 400,000+ students already enrolled!
Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!

Written by

The Click Reader

At The Click Reader, we are committed to empowering individuals with the tools and knowledge needed to excel in the ever-evolving field of data science. Our sole focus is delivering a world-class data science bootcamp that transforms beginners and upskillers into industry-ready professionals.

Getting Started with Time-Series Data

1. Importing necessary libraries

2. Importing and visualizing the dataset

3. Data cleaning time-series data

4. Feature Engineering time-series data

5. Splitting and normalizing time-series data

Related Articles

Introduction - Learn Matplotlib for Data Science

How to install Python for Windows?

Introduction to Supervised Machine Learning

Introduction to Convolutional Neural Networks

Interested In Data Science Bootcamp?
Request more info now.

Getting Started with Time-Series Data

1. Importing necessary libraries

2. Importing and visualizing the dataset

3. Data cleaning time-series data

4. Feature Engineering time-series data

5. Splitting and normalizing time-series data

Related Articles

Introduction - Learn Matplotlib for Data Science

How to install Python for Windows?

Introduction to Supervised Machine Learning

Introduction to Convolutional Neural Networks

Interested In Data Science Bootcamp?Request more info now.

Interested In Data Science Bootcamp?
Request more info now.