Rectified Linear Unit For Artificial Neural Networks - Part 1 Regression


Our brains house a huge network of nearly a 100 billion tiny neural cells (aka neurons) connected by axons.

Neural Networks: Neurons communicate by sending electric charges to each other. Neurons only fire an electric charge if they are sufficiently stimulated, in which case the neuron is activated. Through an incredibly intricate scheme of communication, each pattern of electric charges fired throughout the brains are translated into our neural activities, whether it is to taste a burger, tell a joke, or enjoy a scenery.

Learning: To activate a neuron, sufficient electric charge is required to go through the axon of that neuron. Some axons are more conductive of electricity than others. If there is too much conductivity in a brain, the person could have seizure and probably death. However, brains are designed to minimize the enerjy consumption. The learning happens in our brains by making the neurons responsible for a certain act or thought more conductive and more connected. So everytime we play a violin for example, the part of our brain that plays the violin gets more and more connected and conductive. This in turn makes the electric charges in this area travel much faster, which translates into faster responses. In other words, playing violin becomes like a "second hand". As the proverb goes "practice makes perfect".

Artificial Neural Networks (ANN): This idea is simulated in artificial neural networks where we represent our model as neurons connected with edges (similar to axons). The value of a neuron is simply the sum of the values of previous neurons connected to it weighted by the weights of their edges. Finally the neuron is passed through a function to decide how much it should be activated, which is called an activation function.

ANN and Linear Algebra: ANNs are just a fancy representation of matrix multiplication. Each layer in an ANN is simply a vector, while the weights connecting layers are matrices. Formally, we refer them as tensors, as they can vary in their dimensionality. For example, consider the following input:

We have 3 layers, input, hidden, and output. The input layer is simply the 16-dimensional feature vector of the input image. The hidden layer is a 4-dimensional vector of neurons that represent a more abstracted version of the raw input features. We obtain this hidden layer by simply multiplying the input vector with the weights matrix $W_1$, which is 16x4. Similarly, the output layer is obtained by multiplying the hidden layer by another weights matrix $W_2$, which is 4x2.

Deep Neural Networks: these ANNs can get really deep by simply adding as many hidden layers as we want, making them Deep Neural Networks (DNN)

Training a neural network: To extremely simply things to an unfair degree, we basically start with random values for weights. We travel through the layers to the output layer, which houses our predictions. We calculate the error of our predictions, and accordingly slightly fix our weight matrices. We repeat until the weights stop changing much. This is not doing justice for the neatness of the gradient descent and back propagation algorithms, but it is enough for using neural networks in applications. Here is a GIF for an error (aka loss) getting smaller and smaller as the weights are modified.

RELU in Regression

Activation Function (ReLU)

We apply activation functions on hidden and output neurons to prevent the neurons from going too low or too high, which will work against the learning process of the network. Simply, the math works better this way.

The most important activation function is the one applied to the output layer. If the NN is applied to a regression problem, then the output should be continous. For the sake of demonstration, we are using the Boston house-prices dataset. A house price cannot be negative. We force this rule by using one of the most intuitive and useful activation functions: Rectified Linear Unit. The only thing it does is; if the value is negative, set it to zero. Yub, that's it.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.datasets import load_boston
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.models import Model

# ensuring that our random generators are fixed so the results remain reproducible
In [2]:
data = load_boston()
X = data["data"]
y = data["target"]
df = pd.DataFrame(X, columns=data["feature_names"])
df["PRICE"] = y
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9

506 rows × 14 columns

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Relu Activation Function in Python

In [4]:
input_shape = X.shape[1] # number of features, which is 13
# this is regression
# so we only need one neuron to represent the prediction
output_shape = 1
In [5]:
# we set up our input layer
inputs = Input(shape=(input_shape,))
# we add 3 hidden layers with diminishing size. This is a common practice in designing a neural network
# as the features get more and more abstracted, we need less and less neurons.
h = Dense(16, activation="relu")(inputs)
h = Dense(8, activation="relu")(h)
h = Dense(4, activation="relu")(h)
# and finally we use the ReLU activation function on the output layer
out = Dense(output_shape, activation="relu")(h)
model = Model(inputs=inputs, outputs=[out])
Model: "functional_1"
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 13)]              0         
dense (Dense)                (None, 16)                224       
dense_1 (Dense)              (None, 8)                 136       
dense_2 (Dense)              (None, 4)                 36        
dense_3 (Dense)              (None, 1)                 5         
Total params: 401
Trainable params: 401
Non-trainable params: 0

We use MSE as the error we are trying to minimize. $$MSE=\frac{1}{n}\sum^n_{i=1}{(y_i-\hat{y_i})^2}$$

Adam is just an advanced version of gradient descent used for optimization. It is relatively faster than other optimizer algorithms. The details are just for another day.

In [6]:
model.compile(optimizer="adam", loss="mean_squared_error")

We fit our model for 4 epochs, where each epoch is a full pass on the entire training data. Epochs are different from learning iterations, as we can do an iteration on batches of the data. However, an epoch passes everytime the model has iterated on all the training data.

In [ ]:
H =
        X_test, y_test
In [8]:
fig = plt.figure(figsize=(16, 9))
plt.plot(H.history["loss"], label="loss")
plt.plot(H.history["val_loss"], label="validation loss")

We notice both the training and testing error plumment quickly in the first few epochs, and converge soon after that. Let's explore the data distribution to better understand how well is the performance.

In [9]:
import seaborn as sns

sns.displot(x=y, kde=True, aspect=16/9)

# Add labels
plt.title(f'Histogram of house prices\nMean: {round(np.mean(y), 2)}\nStandard Deviation: {round(np.std(y), 2)}', fontsize=22)
plt.xlabel('House Price Range', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.xticks(np.arange(0, 50, 2))
In [10]:
y_pred = model.predict(X_test)
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred))}")
print(f"MAE: {mean_absolute_error(y_test, y_pred)}")
print(f"R2: {r2_score(y_test, y_pred)}")
RMSE: 7.416857545316182
MAE: 5.717547614931121
R2: 0.2144506690278849

While the data seem to be normally distributed, RMSE is less than one standard deviation. This indicates a good performance of the model!