Time Series Tips and Tricks

Time Series Tips and Tricks

Time Series Forecasting using Deep Learning with TensorFlow

In this tutorial we will see the code walkthrough for using TensorFlow and Deep Learning for doing time series prediction with a practical dataset. We will use a custom Deep Neural Network for solving an univariate time series problem. As an example of the time series data, we will be using the Sunspot Data from Kaggle. The data can be easily downloaded from my GitHub project TimeSeries-Using-TensorFlow. I will encourage everyone to use google colab notebooks where the modules required are already installed and the infrastructure is ready for use. Now, let’s begin!

Step 1 – Downloading and loading the data

You can download the data using a simple command –

!wget --no-check-certificate https://raw.githubusercontent.com/adib0073/TimeSeries-Using-TensorFlow/main/Data/Sunspots.csv -O /tmp/sunspots.csv

Once the data is downloaded, and saved in the directory, we will use pandas to load the data as dataframes.

# Loading the data as a pandas dataframe
df = pd.read_csv('/tmp/sunspots.csv', index_col=0)
df.head()

Usually, after loading the data, the better approach is to perform an extensive EDA. But in this case, we will skip the EDA part, but try to visualize the data, to observe any common trend, seasonality or typical pattern within the time series data.

From the initial visualization, we don’t see any clear upward or downward trend, but we do see some seasonal or cyclic pattern within the data. Now, I would recommend everyone to do an extensive EDA to understand or estimate the seasonal period of the data. Based on which we get intuitions to vary the window size or even helps us to determine the train-test split ratio.

Step 2 – Data Preparation

In the data preparation part, we will need to transform and process the data in such a format, so that it can be feed to a Deep Neural Network (DNN) model and ready for model training. So, you might be wondering what are the typical pre-processing steps that needs to be followed? Now, time series forecast can be considered as a sequential machine learning regression problem, in which the sequence data is converted into a series of feature values and the corresponding true or target value. The target value or the true value is required as regression is a supervised learning problem. And the lagged time series values becomes the feature values.

So just to give an example, if we have a square real number series like 1, 4, 9, 16, 25, 36 and 49 and we want to come up with a machine learning model to predict the next number after 49, how will we do that?

We would need to consider a window size, move the window from left to right of the sequence or the series data. Now, we can consider the value immediately to right as target or the true value. So, each time we shift or move the window, we get a new row of features values and target value pairs. In this way we form the training data and training labels. In a similar way, we form the test and the validation dataset, which is typically required for a machine learning prediction model. But the train-test-validation split ratio is kept based on the size of the data. For this example I have used a split ratio of 0.8 and based on the seasonality of the data, we have taken a window size of 60.

# Convert the data values to numpy for better and faster processing
time_index = np.array(df['Date'])
data = np.array(df['Monthly Mean Total Sunspot Number'])


# Certain Hyper-parameters to tune
SPLIT_RATIO = 0.8
WINDOW_SIZE = 60
BATCH_SIZE = 32
SHUFFLE_BUFFER = 1000

# Dividing into train-test split
split_index = int(SPLIT_RATIO * data.shape[0])


# Train-Test Split
train_data = data[:split_index]
train_time = time_index[:split_index]

test_data = data[split_index:]
test_time = time_index[split_index:]

Now, as a best practice for Deep Neural Network training, it is always advisable to shuffle the training data and training in mini-batches instead of one at a time per epoch. For this reason, we will use a time series data generator routine, that creates the required pairs of feature values and target values and does the necessary random shuffling.

def ts_data_generator(data, window_size, batch_size, shuffle_buffer):
  '''
  Utility function for time series data generation in batches
  '''
  ts_data = tf.data.Dataset.from_tensor_slices(data)
  ts_data = ts_data.window(window_size + 1, shift=1, drop_remainder=True)
  ts_data = ts_data.flat_map(lambda window: window.batch(window_size + 1))
  ts_data = ts_data.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
  ts_data = ts_data.batch(batch_size).prefetch(1)
  return ts_data
train_dataset = ts_data_generator(train_data, WINDOW_SIZE, BATCH_SIZE, SHUFFLE_BUFFER)
test_dataset = ts_data_generator(test_data, WINDOW_SIZE, BATCH_SIZE, SHUFFLE_BUFFER)

With these we have the required processed data which is ready for model training.

Step 3 – Building DNN Model and Training

For this problem, we will use a very simple, 3 layered Neural Network model using TensorFlow Keras API and train on the training dataset and evaluate the model on the test dataset. Just a small note before we can start training the model, there are many hyper-parameters which you need to be aware of. Tuning these hyper-parameters can yield the best possible result. Some of these hyper-parameters are the number of neural network layers, the number of hidden units in each layer, the activation function, the loss function, the optimizer to be used, the learning rate of the optimizer and the number of epochs required for training the model. And the code for defining the model and training is as follows:

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(20, input_shape=[WINDOW_SIZE], activation="relu"), 
    tf.keras.layers.Dense(10, activation="relu"),
    tf.keras.layers.Dense(1)
])

model.compile(loss="mse", optimizer=tf.keras.optimizers.SGD(lr=1e-7, momentum=0.9))
model.fit(train_dataset, epochs=200,validation_data = test_dataset)

Step 4 – Model Evaluation on Test Data

Once the model training is complete, the next step would be to evaluate how well the model is performing. In this step, one crucial process is the choice of the evaluation metric. Usually I prefer using metrics like Mean Absolute Percentage Error (MAPE) or Symmetric Mean Absolute Percentage Error (SMAPE), and sometimes I go for Root Mean Square Error (RMSE), but in this case I will be use the metric as MAE or mean absolute error.

The code for generating the forecast on the test data and calculating the MAE value is as follows:

forecast=[]
for time in range(len(data) - WINDOW_SIZE):
  forecast.append(model.predict(data[time:time + WINDOW_SIZE][np.newaxis]))

forecast = forecast[split_index-WINDOW_SIZE:]
results = np.array(forecast)[:, 0, 0]


# Overall Error
error = tf.keras.metrics.mean_absolute_error(test_data, results).numpy()
print(error)
# 16.528038

Thus we get a mean absolute error of around 16, which is not that bad considering the dataset! But can these results be improved? Absolutely! Do you know how? May be you can try the following –

  1. Increasing the training data to 85% of 90% of the total data
  2. Increasing the number of layers or adding more hidden unit
  3. Choosing the right window and batch size (best would be to choose a number which is a factor of the seasonality of the data)
  4. Experimenting with a different learning rate and loss function

But nevertheless, we do have a good enough time series forecasting model using Deep Neural Network and TensorFlow! Now, finally let’s visualize and see how good our model is:

Not bad after all! We see that the predictions are almost matching the actual values with few exceptions around the peaks! But the overall pattern is well aligned!

One final point to convey, since forecast as a single point estimate can never be 100% accurate, so it is always better to convey a time series forecast within a range of forecasted values, which is demonstrated using an error band or confidence interval around the predicted value.

The code for the visualization part is as follows:

plt.figure(figsize=(15, 6))

plt.plot(list(range(split_index,len(data))), test_data, label = 'Test Data')
plt.plot(list(range(split_index,len(data))), results, label = 'Predictions')
plt.legend()
plt.show()

plt.figure(figsize=(15, 6))
# Plotting with Confidence Intervals
plt.plot(list(range(split_index,len(data))), results, label = 'Predictions', color = 'k', linestyle = '--')
plt.fill_between(range(split_index,len(data)), results - error, results + error, alpha = 0.5, color = 'red')
plt.legend()
plt.show()

And the final visualization for the forecast with the confidence bands is as follows:

As long as the error band or the confidence interval is narrow and as long as the actual values lie within the confidence band, it means that we have a good prediction model!

Thus this bring us to the end of the tutorial post about using TensorFlow to solve a time series prediction problem using Deep Learning. The entire code or the notebook is already available in my GitHub account. Do like, share and comment for more such tutorials and keep following to learn more!

Tags: , , ,

2 Responses

  1. Shreya says:

    Very helpful

Leave a Reply

Your email address will not be published. Required fields are marked *