Skip to content

General Applications of Neural Networks
Session 6: Recurrent Neural Networks and Time Series Analysis

Instructor: Wesley Beckner


In this session, we'll be exploring NN as they apply to sequenced data, specifically time series data.

6.0 Preparing Environment and Importing Data

back to top

6.0.1 Import Packages

back to top

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import as px
import random
from scipy.stats import gamma, norm, expon
from ipywidgets import interact
from statsmodels.tsa.stattools import pacf, acf
from sklearn.metrics import mean_squared_error

def melt_results(model, X, y, window_size):
  y_pred = model.predict(X)
  results = pd.DataFrame(y_pred, y)
  results = results.reset_index()
  results.index = orders.loc[0, time_cols].index[window_size:]
  results = results.reset_index()
  results.columns=['Date', 'real', 'predicted']
  results = results.melt(id_vars='Date', var_name='Source', value_name='KG')
  return results

def process_data(Xy, window=3, time_cols=12, remove_null=False):
  This function splits your time series data into the proper windows

  Xy: array
    The input data. If there are non-time series columns, assumes they are on
    the left and time columns are on the right. 
  time_cols: int
    The number of time columns, default 12
  window: int
    The time window size, default 3

  X_: array
    The independent variables, includes time and non-time series columns with
    the new window
  y_: array
    The dependent variable, selected from the time columns at the end of the 
    The time series labels, can be used in subsequent plot
  # separate the non-time series columns
  X_cat = Xy[:,:-time_cols]

  # select the columns to apply the sweeping window
  X = Xy[:,-time_cols:]

  X_ = []
  y = []

  for i in range(X.shape[1]-window):
    # after attaching the current window to the non-time series 
    # columns, add it to a growing list
    X_.append(np.concatenate((X_cat, X[:, i:i+window]), axis=1))

    # add the next time delta after the window to the list of y
    # values
    y.append(X[:, i+window])

  # X_ is 3D: [number of replicates from sweeping window,
  #           length of input data, 
  #           size of new feature with categories and time]
  # we want to reshape X_ so that the replicates due to the sweeping window is 
  # a part of the same dimension as the instances of the input data
  X_ = np.array(X_).reshape(X.shape[0]*np.array(X_).shape[0],window+X_cat.shape[1])
  y = np.array(y).reshape(X.shape[0]*np.array(y).shape[0],)

  if remove_null:
    # remove training data where the target is 0 (may be unfair advantage)
    X_ = X_[np.where(~np.isnan(y.astype(float)))[0]]
    y = y[np.where(~np.isnan(y.astype(float)))[0]]

  # create labels that show the previous month values used to train the model
  labels = []
  for row in X_:
    labels.append("X: {}".format(np.array2string(row[-window:].astype(float).round())))
  return X_, y, labels

def train_test_process(data, train_test_val_ratios = [0.6, 0.8], window_size=3):
  # get the indices at the associated ratios
  idx_split1 = int(data.shape[1]*train_test_val_ratios[0])
  idx_split2 = int(data.shape[1]*train_test_val_ratios[1])

  # index the data to build the sets
  data_train = data[:,:idx_split1]
  data_val = data[:,idx_split1:idx_split2]
  data_test = data[:,idx_split2:]

  # build out the training sets with the sweeping window method
  X_train, y_train, labels = process_data(data_train, window=window_size, time_cols=132)
  X_val, y_val, labels = process_data(data_val, window=window_size, time_cols=132)
  X_test, y_test, labels = process_data(data_test, window=window_size, time_cols=132)

  print("train size: {}".format(X_train.shape[0]))
  print("val size: {}".format(X_val.shape[0]))
  print("test size: {}".format(X_test.shape[0]), end='\n\n')

  return X_train, y_train, X_val, y_val, X_test, y_test

6.0.2 Load Dataset

back to top

orders = pd.read_csv(""\

cat_cols = ['base_cake', 'truffle_type', 'primary_flavor', 'secondary_flavor',
       'color_group', 'customer']

time_cols = [i for i in orders.columns if i not in cat_cols]

# note that our data is 'untidy' if we wanted to tidy the data we would need to
# unpivot or 'melt' our date columns like so:
orders.melt(id_vars=cat_cols, var_name='date', value_name='kg')

# however the data as it is, is useful for our purposes of timeseries prediction 
# today
base_cake truffle_type primary_flavor secondary_flavor color_group customer date kg
0 Cheese Candy Outer Horchata Vanilla Amethyst Perk-a-Cola 1/2010 12570.335165
1 Tiramisu Chocolate Outer Irish Cream Egg Nog Slate Dandy's Candies 1/2010 7922.970436
2 Sponge Chocolate Outer Ginger Ale Apple Slate Dandy's Candies 1/2010 10521.306722
3 Cheese Chocolate Outer Coffee Pear Opal Dandy's Candies 1/2010 4739.122200
4 Chiffon Jelly Filled Butter Toffee Apricot Olive Slugworth 1/2010 2756.891961
... ... ... ... ... ... ... ... ...
13195 Chiffon Chocolate Outer Acai Berry Tangerine Slate Fickelgruber 12/2020 25714.512372
13196 Butter Jelly Filled Plum Peppermint Olive Fickelgruber 12/2020 15043.303525
13197 Chiffon Chocolate Outer Wild Cherry Cream Peppermint Taupe Perk-a-Cola 12/2020 8769.613116
13198 Cheese Candy Outer Mango Mango Rose Dandy's Candies 12/2020 5065.975534
13199 Sponge Chocolate Outer Ginger Ale Passion Fruit Black Fickelgruber 12/2020 9466.712219

13200 rows Ɨ 8 columns

6.1 Why We Think in Sequences

back to top

There are some problems that are best framed as a sequence in either the input or the output. For example, in our image classification we are performing a mapping of many-to-one: sequence input (the pixels) to a single output (classification). Other examples include:

  • One-to-many: sequence output, e.x. word (if treated as a single input) to generate a picture
  • Many-to-many: sequence input and output, e.x. machine translation (like english to mandarin)
  • Synchronized many-to-many: synced sequence input and output, e.x. video classification

State of the art handling of sequences has occurred in a class of networks called recurrent neural networks

6.2 Recurrent Neural Networks

back to top

Recurrent Neural Networks (RNNs) can be thought of as a FFNN with loops added into the architecture. This allows the network to retain information, create "memory" that can be associated with signals later in the sequence.

We didn't go into much detail about the actual training algorithm of neural networks: back propagation. But what we will say here, is that this algorithm breaks down with recurrent neural networks because of the looped connections. A trick was created to overcome this, where the looped connections are unrolled, using a copy of the "unhooked" neuron to represent where the loop was initally fed back. This algorithm is called back propagation through time.

Another problem is introduced when training recurrent neural networks, in that the gradients calculated during back propagation can become very large, exploding gradients, or very small vanishing gradients. This problem is modulated in FNNNs by the ReLU, In RNNs, a more sophisticated gating mechanism is used in an architecture we call Long Short-Term Memory Networks

LSTM shown in both typical and unfolded format

6.2.1 Long Short-Term Memory Networks

back to top

Long Short-Term Memory Networks (LSTMs) are a type of RNN that are trained using back propagation through time and overcome the vanishing/exploding gradient problem. Similar to CNNs, their architecture is composed of blocks, this time with memory blocks rather than convolutional blocks. A block is smarter than the classical neuron; it contains gates that manage the block's state and output. The gates are operated by a sigmoid function, determining whether they are open or closed (triggered or not trigerred). There are three types of gates within a memory block:

  • Forget gate: decides what information is discarded
  • Input gate: decides what information updates the memory state
  • Output gate: decides what information to send forward depending on the input and memory state

These weights that configure these gates are learned during training, and their coordination allow each memory block to learn sophisticated relationships in and among sequenced data.

Big takeaway: memory blocks contain trainable parameters that allow the block to learn relationships between sequenced data

6.3 Exploratory Data Analysis with Plotly/Pandas

back to top

base_cake truffle_type primary_flavor secondary_flavor color_group customer 1/2010 2/2010 3/2010 4/2010 ... 3/2020 4/2020 5/2020 6/2020 7/2020 8/2020 9/2020 10/2020 11/2020 12/2020
0 Cheese Candy Outer Horchata Vanilla Amethyst Perk-a-Cola 12570.335165 11569.168746 13616.812204 11884.370881 ... 21575.521051 18856.178110 20701.250676 19406.448560 22328.687163 19384.824042 21449.154890 19554.405590 21873.104938 19572.860127
1 Tiramisu Chocolate Outer Irish Cream Egg Nog Slate Dandy's Candies 7922.970436 6464.558625 6616.092291 8244.991928 ... 10145.394106 13132.925131 10821.805709 10829.961838 12995.340352 10504.814195 10617.199735 13377.165673 11065.835571 11135.386324
2 Sponge Chocolate Outer Ginger Ale Apple Slate Dandy's Candies 10521.306722 5543.335645 5294.892374 11010.452413 ... 8516.098910 17498.911792 8369.846849 8334.206937 17519.678690 8595.378915 8909.348040 17234.636475 9002.216839 8794.467252
3 Cheese Chocolate Outer Coffee Pear Opal Dandy's Candies 4739.122200 2733.281035 4984.394797 2750.709519 ... 8093.541144 4301.081977 8235.616589 4151.474242 8213.665500 4008.885583 7912.641813 4275.162782 8031.227879 4628.989194
4 Chiffon Jelly Filled Butter Toffee Apricot Olive Slugworth 2756.891961 1739.900797 1791.975108 1533.023665 ... 1864.015449 1800.566323 1625.130275 1908.316219 2696.631511 1859.017636 1690.042699 1764.410866 1909.608709 1711.780317

5 rows Ɨ 138 columns

data = pd.DataFrame(orders.loc[0, time_cols])
data = data.reset_index()
data.columns = ['Date', 'KG']
px.scatter(data, x='Date',  y='KG')

fig, ax = plt.subplots(1,1,figsize=(10,10))
pd.plotting.autocorrelation_plot(data['KG'], ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff3c213eb50>


Normally with time series data, we'd want to try a host of preprocessing techniques and remove the trend (really create two separate analyses, one of the trend and one of the seasonality) but to keep things simple and to showcase the utility of machine learning, we are going to deviate from the stats-like approach and work with our data as is.

For more details on the stats-like models you can perform a cursory search on ARIMA, ARMA, SARIMA

6.4 Modeling

back to top

from tensorflow import keras
from tensorflow.keras import layers

6.4.1 Sweeping (Rolling) Window

back to top

We're going to revist this idea of a sweeping window from our feature engineering disucssion. It turns out, even though we are using a NN, there is still some preprocessing we need to do. In our case, each time delta is represented by a month. So we will choose some number of months to include in our feature set, this will in turn determine what our overall training data will look like.

Xy = orders.loc[[0], time_cols].values
# separate the non-time series columns
X_cat = Xy[:,:-120]

# select the columns to apply the sweeping window
X = Xy[:,-120:]

with a window size of 3, our X will have 3 features, the prior 3 months leading up to the month for which we will attempt to forecast.

window_size = 3
X, y, labels = process_data(orders.loc[[0], time_cols].values, window=window_size, time_cols=132)
array([[12570.33516483, 11569.16874623, 13616.81220446],
       [11569.16874623, 13616.81220446, 11884.37088102],
       [13616.81220446, 11884.37088102, 13950.33233441],
       [11884.37088102, 13950.33233441, 12781.15653568],
       [13950.33233441, 12781.15653568, 14256.21002336]])

With a window size of 1, our X data will have a feature size of 1

window_size = 1
X, y, labels = process_data(orders.loc[[0], time_cols].values, window=window_size, time_cols=132)

and so on.

6.4.2 FFNN

back to top

I'm going to start with a very simple FFNN model:

model = keras.Sequential([
    layers.Dense(8, input_shape=[window_size]), # one layer, 8 nodes
    layers.Dense(1) # single output for the kg

model.compile(loss='mean_squared_error', optimizer='adam')
early_stopping = keras.callbacks.EarlyStopping(
history =
    X, y,
    verbose=0, # hide the output because we have so many epochs
history_df = pd.DataFrame(history.history)
19 4.243438e+06
20 4.240187e+06
21 4.240501e+06
22 4.247284e+06
23 4.238702e+06

As we can see from the y vs y_pred the FFNN is just predicting the previous month's value:

y_pred = model.predict(X)
pd.DataFrame(y_pred, y)
11569.168746 12562.842773
13616.812204 11562.307617
11884.370881 13608.662109
13950.332334 11877.311523
12781.156536 13941.969727
... ...
19384.824042 22315.048828
21449.154890 19373.041016
19554.405590 21436.068359
21873.104938 19542.517578
19572.860127 21859.757812

131 rows Ɨ 1 columns

We can try this with a more suitable window size

window_size = 3
X, y, labels = process_data(orders.loc[[0], time_cols].values, window=window_size, time_cols=132)

model = keras.Sequential([
    # layers.Dense(8, input_shape=[window_size]),
    layers.Dense(1, input_shape=[window_size])

model.compile(loss='mean_squared_error', optimizer='adam')
history =
    X, y,
    verbose=0, # hide the output because we have so many epochs
history_df = pd.DataFrame(history.history)
752 550135.6875
753 555000.3750
754 550800.3125
755 551368.6250
756 548760.4375

A cursory glance looks like our values are closer together

results = melt_results(model, X, y, window_size)
px.line(results, x='Date', y='KG', color='Source')

šŸ‹ļø Exercise-Discussion 1: Varify that the model is linear

back to top

We're having to change our way of thinking here with time series analysis. Recall that a model without an activation function can only encapsulate linear relationships. How come we can see non-linear relationships in our time series plot above? make a plot that showcases we are indeed still within a linear world.

This is an open ended question, think about how you would attempt to show linearity of the model. (In Lab 1 our model predicted on only 2 dimensions (vs 3, in this case), and it was a binary classification task, so it was easier to view the decision boundaries and verify linearity).

# Code cell for Exercise 1

šŸ‹ļø Exercise 2: Vary model architecture and window size

back to top

Create these three different models. Train on the whole dataset with a window size of 3. record the training loss for the last 5 epochs of each model

models = [
    layers.Dense(8, input_shape=[window_size]),
    layers.Dense(8, activation='relu', input_shape=[window_size]),
    layers.Dense(4, activation='relu', input_shape=[window_size]),

You can create the training sets with:

window_size = 3
X, y, labels = process_data(orders.loc[[0], time_cols].values, window=window_size, time_cols=132)

Use a batch size of 10 when training.

When you are finished training a model use melt_results and plotly to make a graph of your predictions vs actuals

df = melt_results(model, X, y, window_size)
px.line(df, x='Date', y='KG', color='Source')

You can use the same early_stopping and fit formula from 6.4.2

# Code cell for exercise 2
window_size = 3
batch_size = 10

models = [
    layers.Dense(8, input_shape=[window_size]),
    layers.Dense(8, activation='relu', input_shape=[window_size]),
    layers.Dense(4, activation='relu', input_shape=[window_size]),

X, y, labels = process_data(orders.loc[[0], time_cols].values, window=window_size, time_cols=132)

dfs = []
for model in models:
  model.compile(loss='mean_squared_error', optimizer='adam')
  history =
      X, y,
      verbose=0, # hide the output because we have so many epochs
  df = melt_results(model, X, y, window_size)
  px.line(df, x='Date', y='KG', color='Source')
[449815.4375, 438517.75, 434076.40625, 436526.9375, 431749.71875, 432751.5, 431072.125, 433903.71875, 434614.8125, 434704.71875]
[319843872.0, 319843424.0, 319842976.0, 319842528.0, 319842080.0, 319841632.0, 319841120.0, 319840704.0, 319840224.0, 319839776.0]
[398923.6875, 398456.15625, 399380.0, 399915.5, 406269.09375, 400187.28125, 397825.96875, 412889.375, 399718.75, 402859.40625]
px.line(dfs[2], x='Date', y='KG', color='Source')

6.4.3 LSTM NN

back to top

Our data preparation for the LSTM NN includes time steps. The parameter input_dim tells our LSTM block how man time steps we have in the input data. This is a reframing (and a more appropriate reframing) of the same problem. The LSTM model is viewing the input feature w/ multiple time steps as a single feature at different times, rather than separate features. We could, for instance, have a second dimension that includes non-time related information, such as the customer name or truffle types (or other featurse that also vary through time, multiple feed rates or T/P, etc).

window_size = 6
batch_size = 10

X, y, labels = process_data(orders.loc[[0], time_cols].values, window=window_size, time_cols=132)

X = X.reshape(-1, 1, window_size)
y = y.reshape(-1, 1, 1)

model = keras.Sequential([
    layers.LSTM(8, activation='relu', input_dim=window_size),

model.compile(loss='mean_squared_error', optimizer='adam')
WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
history =
    X, y,
    verbose=0, # hide the output because we have so many epochs
history_df = pd.DataFrame(history.history)
4050 4906831.5
4051 4906659.0
4052 4906790.5
4053 4906763.0
4054 4906620.5
results = melt_results(model, X, y.flatten(), window_size)
px.line(results, x='Date', y='KG', color='Source')

šŸ‹ļø Exercise 3: Compare LSTM with FFNN using Train/Val/Test sets and 3 Month Window

back to top

window_size = 3
batch_size = 10
patience = 50

# training on single order history
data = orders.loc[[0], time_cols].values

# describes the split train 0-.6/val .6-.8/test .8-1
train_test_val_ratios = [0.6, 0.8]

X_train, y_train, X_val, y_val, X_test, y_test = train_test_process(data,


early_stopping = keras.callbacks.EarlyStopping(


model = keras.Sequential([
    layers.Dense(4, activation='relu', input_shape=[window_size]),

# compile the model
model.compile(loss='mean_squared_error', optimizer='adam')

# fit the model
history =
    X_train, y_train,
    validation_data=(X_val, y_val),
    verbose=0, # hide the output because we have so many epochs

train size: 76
val size: 23
test size: 24

             loss     val_loss
796  274950.81250  705562.1875
797  278971.28125  726105.4375
798  283036.78125  696195.4375
799  284960.03125  723620.4375
800  279106.31250  691688.0000

We'll then record the mse performance of the model to later compare with the LSTM

results = []
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
results.append(['Dense', mse])
[['Dense', 444380.32838419516]]

We'll use the same parameters (window size, batch size, and early stopping to train the LSTM and compare the optimum FFNN architecture we previously used)

X_train = X_train.reshape(-1, 1, window_size)
y_train = y_train.reshape(-1, 1, 1)
X_val = X_val.reshape(-1, 1, window_size)
y_val = y_val.reshape(-1, 1, 1)
X_test = X_test.reshape(-1, 1, window_size)
y_test = y_test.reshape(-1, 1, 1)

model = keras.Sequential([
    layers.LSTM(8, activation='relu', input_dim=window_size),

model.compile(loss='mean_squared_error', optimizer='adam')

history =
    X_train, y_train,
    validation_data=(X_val, y_val),

WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
            loss     val_loss
836  273390.4375  673023.0625
837  271066.7500  694139.5625
838  275661.9375  705827.1875
839  274106.7500  680028.2500
840  270606.5000  691417.3750
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test.flatten(), y_pred.flatten())
results.append(['LSTM', mse])

Comparison of results:

pd.DataFrame(results, columns=['Model', 'Test MSE']).set_index('Model').astype(int)
Test MSE
Dense 444380
LSTM 424835

As a last visualization in this exercise we'll look at the trian/val/test predictions along the actual

data = orders.loc[[0], time_cols].values
idx_split1 = int(data.shape[1]*train_test_val_ratios[0])
idx_split2 = int(data.shape[1]*train_test_val_ratios[1])

y_p_train = model.predict(X_train)
y_p_val = model.predict(X_val)
y_p_test = model.predict(X_test)
new = orders.loc[[0], time_cols].T.reset_index()
new.columns = ['Date', 'Real']
new['Train'] = np.nan
new.iloc[window_size:idx_split1,2] = y_p_train
new['Val'] = np.nan
new.iloc[idx_split1+window_size:idx_split2,3] = y_p_val
new['Test'] = np.nan
new.iloc[idx_split2+window_size:,4] = y_p_test
new = new.melt(id_vars='Date', var_name='Source', value_name='KG')
px.line(new, x='Date', y='KG', color='Source')

6.5 Model Extensibility

back to top

from ipywidgets import interact

šŸ‹ļø Exercise 4: Apply Model to Other Orders

Take the last LSTM model and apply it to other orders in the dataset. What do you notice?

back to top

def apply_model(dataset=orders.index, window_size=3):
  window_size = window_size
  data = pd.DataFrame(orders.loc[dataset, time_cols])
  data = data.reset_index()
  data.columns = ['Date', 'KG']

  X, y, labels = process_data(orders.loc[[dataset], 

  y_pred = model.predict(X.reshape(-1, 1, window_size)).flatten()

  results = pd.DataFrame(y_pred,  y)
  results = results.reset_index()
  results.index = data['Date'][window_size:]
  results = results.reset_index()
  results.columns=['Date', 'real', 'predicted']
  results = results.melt(id_vars='Date', var_name='Source', value_name='KG')

  fig = px.line(results, x='Date', y='KG', color='Source')
  return fig
interactive(children=(Dropdown(description='dataset', options=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1ā€¦

<function __main__.apply_model>

šŸ‹ļø Exercise-Discussion 5.1: How Would You Create a General Forecast Model?

back to top

After exploring how your model does on other order histories, what do you think is a good strategy for developing company wide order forecasts?

Some possible questions:

  • should you create a single model for the whole company?
  • could you embed meta data about the order in this all-inclusive model?
  • should you make models specific to certain customers, products, etc.
  • what kind of analysis could you do before hand to determine how your models should be grouped?
melted = orders.melt(id_vars=['base_cake', 'truffle_type', 'primary_flavor', 'secondary_flavor',
       'color_group', 'customer'], var_name='month', value_name='kg')
def my_eda(color=cat_cols):
  fig = px.line(melted, x='month', y='kg', color=color)
  return fig
interactive(children=(Dropdown(description='color', options=('base_cake', 'truffle_type', 'primary_flavor', 'sā€¦

<function __main__.my_eda>

šŸ‹ļø Exercise 5.2: EDA

back to top

In our quest to create a model that works well for all orders to truffltopia. I tell you that there are some orders with patterned behavior, according to their meta data. Your first task, is to find out which categorical variable best separates the data. You can use any statistical or visual method you like

# recall the categorical variables:
['base_cake', 'truffle_type', 'primary_flavor', 'secondary_flavor', 'color_group', 'customer']

From C1 S6, it may be useful to think of this diagram:

šŸ‹ļø Exercise 5.3: Decide on Model

back to top

Will you model the whole dataset together? Will you create a number of submodels? Choose based on the groupings you determined statistically significant in the data.

As a base comparison I have provided a formula that trains a model on the entire order history:

data = orders
data = data[time_cols].values

batch_size = 256
window_size = 12

print("batch size: {}".format(batch_size))
print("window size: {}".format(window_size), end='\n\n')

# describes the split train 0-.6/val .6-.8/test .8-1
train_test_val_ratios = [0.8, 0.9]

X_train, y_train, X_val, y_val, X_test, y_test = train_test_process(data,

early_stopping = keras.callbacks.EarlyStopping(

model = keras.Sequential([
    layers.Dense(8, input_shape=[window_size]),

model.compile(loss='mean_squared_error', optimizer='adam')

history =
    X_train, y_train,
    validation_data=(X_val, y_val),
    verbose=0, # hide the output because we have so many epochs

data = orders
data = data[time_cols].values

batch_size = 256
window_size = 12

print("batch size: {}".format(batch_size))
print("window size: {}".format(window_size), end='\n\n')

# describes the split train 0-.6/val .6-.8/test .8-1
train_test_val_ratios = [0.8, 0.9]

X_train, y_train, X_val, y_val, X_test, y_test = train_test_process(data,

early_stopping = keras.callbacks.EarlyStopping(

model = keras.Sequential([
    layers.Dense(8, input_shape=[window_size]),

model.compile(loss='mean_squared_error', optimizer='adam')

history =
    X_train, y_train,
    validation_data=(X_val, y_val),
    verbose=0, # hide the output because we have so many epochs

batch size: 256
window size: 12

train size: 9300
val size: 100
test size: 200

             loss      val_loss
178  273904.93750  533661.50000
179  288718.50000  519464.28125
180  295474.71875  513898.46875
181  299524.78125  664799.06250
182  283324.56250  509953.53125

And a history of the loss with the following settings:

batch size: 256
window size: 12

train size: 9300
val size: 100
test size: 200

             loss     val_loss
326  279111.15625  953265.0625
327  322529.15625  580780.2500
328  285901.56250  476007.4375
329  302237.68750  496192.8125
330  281779.40625  480916.6250