Before we dive into time-series forecasting with TensorFlow 2.0, we first need to be familiar with time-series data and how we can manipulate it using Python and TensorFlow.
In this lesson, we will learn to perform data pre-processing, data visualization, feature engineering, and training/validation/testing set divisions, etc. on time-series data.
1. Importing necessary libraries
First, let us import some essential Python libraries that will be used later in this chapter for manipulating time-series data.
import os import datetime # For data manipulation import numpy as np import pandas as pd # For data visualization import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns # For building model and loading dataset import tensorflow as tf # Set basic configurations mpl.rcParams['figure.figsize'] = (8, 6) mpl.rcParams['axes.grid'] = False %matplotlib inline
Now that we’ve imported the necessary libraries, let us import and visualize the dataset.
2. Importing and visualizing the dataset
The dataset that we will be using for this tutorial is the weather time series dataset which contains 14 different features such as humidity, air temperature, etc. All of these recorded data are separated by 10 minutes of time.
We will be using the get_file() module of tf.keras to download the dataset and save it as a CSV file.
zip_path = tf.keras.utils.get_file( origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip', fname='jena_climate_2009_2016.csv.zip', extract=True) csv_path, _ = os.path.splitext(zip_path)
It may take about 20-60 seconds to download the dataset based on your internet connection. After downloading, let us load the data and have a look at the first 5 rows of it.
# Reading in the dataset df = pd.read_csv(csv_path) # Looking at the first five rows of the DataFrame df.head()
As we can see in the Date Time
column, the data is recorded at the interval of every 10 minutes. However, to make this tutorial easier to digest in a single go, we will be sub-sampling the data into 1-hour intervals rather than 10 minutes.
# slice [start:stop:step], starting from index 5 take every 6th record. df = df[5::6] # Store the datetime values in a separate variable for future processing date_time = pd.to_datetime(df.pop('Date Time'), format='%d.%m.%Y %H:%M:%S') # Looking at the first five rows of the DataFrame df.head()
As we can see, we have successfully sub-sampled our original dataset. You can also apply a similar technique to sub-sample other time-series data you can encounter in the future. Let us now visualize how some of the features look in relation to the time.
plot_cols = ['T (degC)', 'p (mbar)', 'rho (g/m**3)'] # Columns we want to plot plot_features = df[plot_cols] # Getting the columns plot_features.index = date_time # Setting the index as date time _ = plot_features.plot(subplots=True)
Since our dataset has a huge number of rows, the line plots drawn above can be somewhat difficult to understand. To take a closer look at the data, we can also plot only a few data points as shown below,
# Taking only first 480 points plot_features = df[plot_cols][:480] plot_features.index = date_time[:480] # Plotting _ = plot_features.plot(subplots=True)
3. Data cleaning time-series data
Now, we are going to clean the time-series data. For that, let us have a look at the statistical values of the dataset that we are working with.
df.describe().transpose()
As we can see, the min value of the wind velocity, wv (m/s) and max. wv (m/s) columns looks incorrect that is -9999. Let us replace it with zeroes.
# Getting indices of wv and max. wv with value -9999 bad_wv = df['wv (m/s)'] == -9999.0 bad_max_wv = df['max. wv (m/s)'] == -9999.0 # Repalcing the incorrect values with 0.0 df.loc[bad_wv,'wv (m/s)'] = 0.0 df.loc[bad_max_wv, 'max. wv (m/s)'] = 0.0 # Checking if the above inplace edits are reflected in the DataFrame df['wv (m/s)'].min()
0.0
4. Feature Engineering time-series data
To build an accurate model, we should spend some time on feature engineering by converting the data into appropriate formats. In this section, we will learn how to perform feature engineering on time-series data.
Let us convert the wind direction and wind velocity columns to a vector with x and y components.
wv = df.pop('wv (m/s)') max_wv = df.pop('max. wv (m/s)') # Convert to radians wd_rad = df.pop('wd (deg)')*np.pi / 180 # Calculate the wind x and y components df['Wx'] = wv*np.cos(wd_rad) df['Wy'] = wv*np.sin(wd_rad) # Calculate the max wind x and y components df['max Wx'] = max_wv*np.cos(wd_rad) df['max Wy'] = max_wv*np.sin(wd_rad)
We can also convert the DateTime feature into multiple features.
timestamp_s = date_time.map(datetime.datetime.timestamp) day = 24*60*60 year = (365.2425)*day df['Day sin'] = np.sin(timestamp_s * (2 * np.pi / day)) df['Day cos'] = np.cos(timestamp_s * (2 * np.pi / day)) df['Year sin'] = np.sin(timestamp_s * (2 * np.pi / year)) df['Year cos'] = np.cos(timestamp_s * (2 * np.pi / year))
5. Splitting and normalizing time-series data
For testing the predictions of our forecasting model, we need to split the data into a validation set as well as a testing set. We’ll be performing a 70:30:10 split where 70% of the data will be our training dataset, 30% of the dataset will be our validation dataset and 10% data of the data will be our validation dataset.
# Dictionary of column names and their indices, i.e., assigning indices to column names column_indices = {name: i for i, name in enumerate(df.columns)} # Number of rows n = len(df) # Splitting the dataset with a 70:20:10 split train_df = df[0:int(n*0.7)] # From 0% to 70% val_df = df[int(n*0.7):int(n*0.9)] # From 70% to 90% test_df = df[int(n*0.9):] # All above 90% # Number of features in our dataset num_features = df.shape[1] print(f'Total number of features: {num_features}')
Total number of features: 15
Next, let us normalize the data using the mean and standard deviation of the training dataset.
train_mean = train_df.mean() train_std = train_df.std() train_df = (train_df - train_mean) / train_std val_df = (val_df - train_mean) / train_std test_df = (test_df - train_mean) / train_std
A violin plot will help us understand about the distribution of the normalized data.
df_std = (df - train_mean) / train_std df_std = df_std.melt(var_name='Column', value_name='Normalized') plt.figure(figsize=(12, 6)) ax = sns.violinplot(x='Column', y='Normalized', data=df_std) _ = ax.set_xticklabels(df.keys(), rotation=90)
Thus, in this chapter, we got familiar with performing data pre-processing, data visualization, feature engineering, and training/validation/testing set divisions, etc. on time-series data. In the next lesson on ‘Time-series Analysis using Python‘, you will get introduced to some basics of time-series forecasting.
Do you want to learn Python, Data Science, and Machine Learning while getting certified? Here are some best selling Datacamp courses that we recommend you enroll in:
- Introduction to Python (Free Course) - 1,000,000+ students already enrolled!
- Introduction to Data Science in Python- 400,000+ students already enrolled!
- Introduction to TensorFlow for Deep Learning with Python - 90,000+ students already enrolled!
- Data Science and Machine Learning Bootcamp with R - 70,000+ students already enrolled!