Skip to content

Open In Colab

General Applications of Neural Networks
Session 1: The Multilayer Perceptron

Instructor: Wesley Beckner

Contact: wesleybeckner@gmail.com



In this session we will introduce Neural Networks! We'll cover the over arching concepts used to talk about network architecture as well as their building blocks.

images in this notebook borrowed from Ryan Holbrook




1.0 Preparing Environment and Importing Data

back to top

!pip uninstall scikit-learn -y

!pip install -U scikit-learn
Found existing installation: scikit-learn 0.24.2
Uninstalling scikit-learn-0.24.2:
  Successfully uninstalled scikit-learn-0.24.2
Collecting scikit-learn
  Using cached scikit_learn-0.24.2-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (2.2.0)
Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Installing collected packages: scikit-learn
Successfully installed scikit-learn-0.24.2

1.0.1 Import Packages

back to top

from tensorflow import keras
from keras import backend as K
from tensorflow.keras import layers
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.model_selection import train_test_split
import plotly.express as px

from sklearn.impute import SimpleImputer
from copy import copy
sns.set()

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn import set_config
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
set_config(display='diagram')

1.0.2 Load Dataset

back to top

Before diving in I want to compare two methods of reading and preprocessing our data:

# import wine data
wine = pd.read_csv("https://raw.githubusercontent.com/wesleybeckner/"\
      "ds_for_engineers/main/data/wine_quality/winequalityN.csv")

# create X and y
X = wine.copy()
y = X.pop('quality')

# split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train)

# the numerical values pipe
num_proc = make_pipeline(SimpleImputer(strategy='median'), # impute with median
                         StandardScaler()) # scale and center

# the categorical values pipe
cat_proc = make_pipeline(
    SimpleImputer(strategy='constant', 
                  fill_value='missing'), # impute with placeholder
    OneHotEncoder(handle_unknown='ignore')) # one hot encode

# parallelize the two pipes
preprocessor = make_column_transformer((num_proc,
                                make_column_selector(dtype_include=np.number)),
                                       (cat_proc,
                                make_column_selector(dtype_include=object)))

X_train_std = preprocessor.fit_transform(X_train) # fit_transform on train
X_test_std = preprocessor.transform(X_test) # transform test and validation
X_val_std = preprocessor.transform(X_val)

y_train_std = np.log(y_train) # log output y
y_val_std = np.log(y_val) # log output y
y_test_std = np.log(y_test) # log output y

preprocessor
ColumnTransformer(transformers=[('pipeline-1',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ),
                                ('pipeline-2',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(fill_value='missing',
                                                                strategy='constant')),
                                                 ('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 )])
SimpleImputer(strategy='median')
StandardScaler()
SimpleImputer(fill_value='missing', strategy='constant')
OneHotEncoder(handle_unknown='ignore')

1.1 Neural Network Building Blocks

back to top

1.1.1 The Perceptron

back to top

The simplest unit of a neural network is the perceptron. Given an input vector \(x\) and an output vector \(y\), we can illustrate this like so:

where \(w\) is a weight applied to \(x\) and \(b\) is an unweighted term that we call the bias. We include a bias so that the perceptron is not entirely dependent on the input data. A neural network learns by updating \(w\) and \(b\) so that it can accurately model \(x\) to \(y\). When we write out the perceptron mathematically we get the following:

which should look familiar! This is our equation for a linear function. In fact, we will see that a neural network is essentially many instances of linear regression along side, and being fed into, one another.

Often, we will have not an input feature vector \(x\) but an input feature matrix, \(X\). We can update our schematic for a perceptron to account for this:

We can write the mathematical formula for this neuron as follows:

In tensorflow/keras we can define this perceptron:

from tensorflow import keras
from tensorflow.keras import layers

# Create a network with 1 linear unit
model = keras.Sequential([
    layers.Dense(units=1,  # number of units (the + filled circle above)
                 input_shape=[3]) # number of x_ (the x filled circle above)
])

In order to build this single perceptron with keras, I had to use some additional language here: layers, dense, sequential. We'll explain what these are referring to in a moment. What I want to draw your attention to now, however, is that we tell layers.Dense that we want 1 unit, the single perceptron, and input_shape=[3], the number of features. Notice that b is automatically included without having it as a parameter. Just as we always have a y intercept in a linear model.

After we introduce the other aspects of the neural network architecture, we will train a single perceptron model and compare it with a linear model, we will see that they are functionally no different.

🏋️ Exercise 1: Single Perceptron

define a single perceptron that could be used to predict wine density from acidity.

Inpsect the weights.

Use the untrained model to predict y and plot this against true y

# Code cell for exercise 1

# DECLARE MODEL
model = keras.Sequential([

    ### YOUR CODE HERE ###    

])
model.weights
[<tf.Variable 'dense_2/kernel:0' shape=(1, 1) dtype=float32, numpy=array([[1.4809548]], dtype=float32)>,
 <tf.Variable 'dense_2/bias:0' shape=(1,) dtype=float32, numpy=array([0.], dtype=float32)>]

And now use the untrained model to predict wine['density'] from wine['fixed acidity']

# in the line below, use model.predict() and provide wine['fixed acidity'] as 
# the input data to predict what wine['density'] will be
# y_pred = 

plt.plot(y_pred, wine['density'], ls='', marker='o', alpha=0.3)
[<matplotlib.lines.Line2D at 0x7f5a9269a890>]

png

1.2 Neural Network Architectures

back to top

1.2.1 Neural Network Layers

back to top

Now that we have the idea of the most basic building block of a neural network, we will start to discuss the larger architecture. The reason we focused on the lowest building block, is that neural networks are modular. They are made up of instances of these perceptrons or neurons. neurons in parallel make up a layer

These layers feed into one another. When each node of a preceding layer is connected to every node of a following layer, we say they are fully connected and the receiving layer is a dense layer. In a moment we will talk about input, output and hidden layers, for neural networks with three or more layers.

1.2.2 The Activation Function

back to top

It turns out that stringing together a bunch of linear functions, will still result in overall linear relationships. We need a way to break out of this. A neat trick is introduced at the output of each neuron. The output passes through an activation function. There are a handful of different activation functions used in practice, the most common is known as the rectifier function:

and the resulting node can be schematically drawn like this:

with the inset of the summation node indicating that at a minimum the resultant y value is 0.

🏋️ Exercise 2: The Rectifier Function

Write a function called my_perceptron that takes x, a length 2 array, as an input. Have your function return the maximum of \((0, w*x + b)\) where w is a length 2 weight vector.

# Code cell for exercise 2
def my_perceptron(x):
  """
  a simple 2 input feature perceptron with predefined weights, intercept, and 
  a rectifier activation function

  Parameters
  ----------
  x: array
    the input array of length 2

  Returns
  -------
  rect: int
    the rectified output of the perceptron
  """

  # # define b, w, and y (y=mx+b)
  # w = 
  # b = 
  # y = 

  # # return the max of 0 and y
  # rect = 
  return rect

After you write your function make sure it returns 0 when the output of the linear component is negative.

def test_zero_output():
  x = np.array([-10,-10]) # 10 * 1 + (-10) * 2 + 1 = -11
  assert my_perceptron(x) == 0, "The output is not zero when it should be!"
test_zero_output()
print("test passing")
test passing

1.2.3 Stacking Layers

back to top

When we stack many layers together, we create what are traditionally regarded as neural networks. the first and last layers are called the input and output layers, while the inbetween layers are referred to as hidden layers, since their outputs are not directly seen. Tradditionally, a neural network with three or more hidden layers is referred to as a deep neural network.

Notice that in this schematic, the last node does not have an activation function. This is typical of a regression task. In a classification task, we might require an activation function here.

1.2.4 Building Sequential Neural Networks

back to top

Now that we have the essential components of a neural network architecture, we can enter into the domain of overall naming conventions for architecure types. The classic neural network architecture is a feed forward neural network, where every preceding layer feeds into the next layer. We will practice building that with keras.

🏋️ Exercise 3: Building Sequential Layers

In the cell bellow, use keras to build a 3-layer network with activation='relu' and 512 units. Create the output layer so that it can predict 1 continuous value.

# Code cell for exercise 3

# DECLARE THE MODEL

model = keras.Sequential([

    ### YOUR CODE HERE ###

    # the hidden ReLU layers

    # the linear output layer 

])

🏋️ Exercise 4: Other Activation Functions

There are other activation functions we can use after the summation in a neural node. Use the code below to plot and inspect them!

Pick one and do a quick google search on what that activation function's best use case is.

# Code cell for exercise 4

import tensorflow as tf

# YOUR CODE HERE: Change 'relu' to 'elu', 'selu', 'swish'... or something else
activation_layer = layers.Activation('relu')

x = tf.linspace(-3.0, 3.0, 100)
y = activation_layer(x) # once created, a layer is callable just like a function

plt.plot(x, y)
plt.xlim(-3, 3)
plt.xlabel("Input")
plt.ylabel("Output")
plt.show()

png

1.3 Neural Network Training

back to top

We've defined neural network architectures, now how do we train them? There are two main concepts here: the loss function which we've encountered before, and the optimizer the means by which we improve the loss function

1.3.1 The Loss Function

back to top

In previous sessions, we've envountered MSE:

Another common loss for neural networks is the mean absolute error (MAE):

In anycase, the loss function describes the difference between the actual and predicted output of the model. The important thing to note, is that the weights in the neural network are systematically updated according to this loss function, they do this via an optimization algorithm.

1.3.2 The Optimizer

back to top

In order to update the neural network weights to improve the loss function, we require an algorithm. Virtually all available algorithms for this purpose fall within the family of stochastic gradient descent. This works essentially in these iterative steps:

  1. a subsample of the input data is passed through the network
  2. a loss is computed
  3. the weights are adjusted in a direction to improve the loss

The key here is in step 3. The brilliance of neural networks, is that the loss function is differentiable with respect to the weights in the network, and so the change in loss can be ascribed to certain weight changes. We refer to this as assigning blame in the network, and it works through the mathematical chain rule of differentiation. We won't go into great detail here, other than to make a nod to it, and that this algorithm (step 3) is referred to as back propagation.

The three step process is repeated until a stop criteria is reached, the simplest being the loss stops improving above some threshold or a desired loss is achieved.

In the above animation, the black line represents the output of the model, the red dots make up a minibatch or simply a batch while the opaque red dots represent the whole training dataset. Exposing the model to an entire round of the training data is referred to as an epoch. The training loss improves with additional rounds of trianing (middle panel) and the weights are adjusted to update the model (right panel).

1.3.3 Batches and Epochs

An epoch is the number of times the model will see the entire training set

A batch is the number of training samples the model will run before calculating a total error and updating its internal parameters.

Variability of batch (from 1 sample all the way to the entire training set size) leads to different categorizations of the optimizer algorithm:

  • Batch Gradient Descent. Batch Size = Size of Training Set
  • Stochastic Gradient Descent. Batch Size = 1
  • Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

We will visit additional details about batch and epochs in the next session when we discuss model evaluation.

1.3.4 Learning Rate

back to top

notice how in the above animation the model makes progressive steps toward a global optimum. The size of these steps is determined by the learning rate. You can think of it as the amount of improvement to make in the direction of steepest descent (the derivative of the loss function in regards to the changes in weights). Sometimes a large step size can result in stepping over crevices in the solution surface and getting stuck, while too small of step sizes can lead to a slow algorithm. Often the optimum learning rate is not obvious, luckily there are some optimizers that are self-calibrating in this regard. Adam is one such optimizer available to us in keras.

# we can compile the model like so
model.compile(
    optimizer="adam",
    loss="mae",
)

🏋️ Exercise 5.1: Train your first Neural Networks

back to top

We're going to train our first neural network.

Take the model you created in exercise 3 and paste it in the cell below. Make sure that the input_shape of the first layer matches the number of features in X_train_std

X_train_std.shape[1]
13
# Code cell for exercise 5

# DECLARE THE MODEL

model = keras.Sequential([

    ### YOUR CODE HERE ###

    # the hidden ReLU layers

    # the linear output layer 

])

Now we'll compile the model

model.compile(
    optimizer='adam',
    loss='mse',
)

And then train for 10 epochs

history = model.fit(
    X_train_std, y_train_std,
    validation_data=(X_val_std, y_val_std),
    batch_size=256,
    epochs=10,
)
Epoch 1/10
15/15 [==============================] - 1s 33ms/step - loss: 0.6389 - val_loss: 0.1925
Epoch 2/10
15/15 [==============================] - 0s 22ms/step - loss: 0.1330 - val_loss: 0.1054
Epoch 3/10
15/15 [==============================] - 0s 22ms/step - loss: 0.0739 - val_loss: 0.0587
Epoch 4/10
15/15 [==============================] - 0s 23ms/step - loss: 0.0442 - val_loss: 0.0378
Epoch 5/10
15/15 [==============================] - 0s 22ms/step - loss: 0.0266 - val_loss: 0.0283
Epoch 6/10
15/15 [==============================] - 0s 23ms/step - loss: 0.0193 - val_loss: 0.0233
Epoch 7/10
15/15 [==============================] - 0s 23ms/step - loss: 0.0165 - val_loss: 0.0212
Epoch 8/10
15/15 [==============================] - 0s 22ms/step - loss: 0.0151 - val_loss: 0.0204
Epoch 9/10
15/15 [==============================] - 0s 22ms/step - loss: 0.0144 - val_loss: 0.0199
Epoch 10/10
15/15 [==============================] - 0s 23ms/step - loss: 0.0140 - val_loss: 0.0194

Let's take a look at our training history:

pd.DataFrame(history.history)
loss val_loss
0 0.638868 0.192506
1 0.133038 0.105378
2 0.073934 0.058686
3 0.044192 0.037832
4 0.026648 0.028253
5 0.019265 0.023272
6 0.016508 0.021151
7 0.015126 0.020368
8 0.014371 0.019929
9 0.014009 0.019429
# convert the training history to a dataframe
history_df = pd.DataFrame(history.history)
# use Pandas native plot method
fig, ax = plt.subplots(figsize=(10,5))
history_df['loss'].plot(ax=ax)
history_df['val_loss'].plot(ax=ax)
ax.set_ylim(0,.05)
(0.0, 0.05)

png

🏋️ Exercise 5.2: Improve loss by varying nodes and hidden layers

Take your former model as a starting point and now either add nodes or layers to see if the model improves

model = keras.Sequential([
    ### YOUR CODE HERE ###

])

model.compile(
    optimizer='adam',
    loss='mse',
)

history = model.fit(
    X_train_std, y_train_std,
    validation_data=(X_val_std, y_val_std),
    batch_size=256,
    epochs=10,
)
Epoch 1/10
15/15 [==============================] - 1s 11ms/step - loss: 3.0090 - val_loss: 1.9299
Epoch 2/10
15/15 [==============================] - 0s 4ms/step - loss: 1.3536 - val_loss: 0.7737
Epoch 3/10
15/15 [==============================] - 0s 3ms/step - loss: 0.4739 - val_loss: 0.2300
Epoch 4/10
15/15 [==============================] - 0s 4ms/step - loss: 0.1334 - val_loss: 0.0781
Epoch 5/10
15/15 [==============================] - 0s 4ms/step - loss: 0.0592 - val_loss: 0.0471
Epoch 6/10
15/15 [==============================] - 0s 4ms/step - loss: 0.0402 - val_loss: 0.0354
Epoch 7/10
15/15 [==============================] - 0s 3ms/step - loss: 0.0335 - val_loss: 0.0325
Epoch 8/10
15/15 [==============================] - 0s 4ms/step - loss: 0.0311 - val_loss: 0.0305
Epoch 9/10
15/15 [==============================] - 0s 3ms/step - loss: 0.0291 - val_loss: 0.0288
Epoch 10/10
15/15 [==============================] - 0s 3ms/step - loss: 0.0274 - val_loss: 0.0271
y_pred = model.predict(X_test_std)
plt.plot(np.exp(y_test_std), np.exp(y_pred), ls='', marker='o')
[<matplotlib.lines.Line2D at 0x7f5d26303e10>]

png

🏋️ Exercise 5.3: Learning Curves

Using 4 hidden layers now create 4 models that run for 30 epochs each:

  1. Vary the number of nodes in each layer
  2. Record the train/val/test score (MSE)
  3. Plot either total nodes or total trainable parameters vs score for each of the 5 models
# Code Cell for Exercise 1.3.5
tot. units test mse val mse train mse
0 4 0.030658 0.030919 0.031088
1 25 0.020217 0.020164 0.019062
2 100 0.017429 0.018110 0.017416
3 289 0.019124 0.019371 0.018190

When we look at our historical loss do we notice that sometimes before the last epoch we actually hit a minimum? We'll discuss how to deal with this in the next session!