The simplest form of a fully recurrent neural network is a multi layer perceptron (MLP) with the previous set of *hidden unit activations* feeding back into the network along with the inputs. First, recall that a simple MLP has $x$ and $y$ as input and output vectors with nonlinear activation layers between them, except perhaps on the readout (or output) layer:

<img src="../images/MLP_classic.jpg" alt="Drawing" style="width: 50px;"/>

Formally, for an MLP with $L$ layers, the output of layer $l$ is given by:

\begin{equation}
h^l= f_H\big(W^lh^{l-1} + b^l\big)
\end{equation}
\begin{equation}
\hspace{0.05cm} y = f_O\big(W^Oh^{L} + b^O\big)
\end{equation}


where $h^0=x$ is the input and $h^l$ is the activation unit of layer $l$. 

For example, consider the single-layer MLP implementation below. Notice that the parameters of the weight matrices are separated into four pars: two for the single hidden layer ($W^1,b^1$) and two for the output layer ($W^O,b^O$).

In [None]:
function mlp1(param, x)
    h = tanh(x * param[1] .+ param[2]) 
    y = h * param[3] .+ param[4]
    return y
end

In contrast, a single-layer RNN becomes time-dependent, taking the previous hidden state $h_{t-1}$ as an extra input and returning the next hidden state $h_t$ as an extra output:

<img src="../images/MLP_rnn.jpg" alt="Drawing" style="width: 100px;"/>

To see how this done, first we make $h^l$ and $y$ time-dependent (recalling that $h^0=x$):

\begin{equation}
h^l_t= f_H\big(W^lh^{l-1}_t + b^l\big)
\end{equation}
\begin{equation}
\hspace{0.05cm} y_t = f_O\big(W^Oh^{L}_t + b^O\big)
\end{equation}

So far we haven't done anything: all weight matrices are still time-independent and we are simply focusing on the propagation of a single sample pair at time $t$. An RNN layer also takes the hidden state at a previous time *from the same layer* as an input, i.e. in addition to $h^{l-1}_t$, layer $l$ will also take $h^l_{t-1}$ as an input:

\begin{equation}
h^l_t= f_H\big(W^l_1h^{l-1}_t + W^l_2h^l_{t-1} + b^l\big)
\end{equation}
\begin{equation}
\hspace{0.05cm} y_t = f_O\big(W^Oh^{L}_t + b^O\big)
\end{equation}

Notice that the output layer is unchanged. Let us consider a single-layer RNN: For simplicity, note that we can combine the weight matrices $W_1$ and $W_2$ into a single matrix by concatenating both inputs. Further, we drop the superscript $l$ for clarity: 

\begin{equation}
h_t= f_H\big(W\,[x_t\,\,h_{t-1}] + b)
\end{equation}
\begin{equation}
\hspace{0.05cm} y_t = f_O\big(W^Oh_t + b^O\big)
\end{equation}

This simple single-layer RNN is implemented below with $h_{t-1}\equiv h_{t^-}$:

In [6]:
function mlp1_rnn(param, hₜ₋, xₜ)
    input  = hcat(hₜ₋, xₜ)
    hₜ = tanh(input * param[1] .+ param[2])
    yₜ = hₜ * param[3] .+ param[4]
    return (hₜ, xₜ)
end

mlp1_rnn (generic function with 1 method)

<h1><center> Advanced Discussion</center></h1>

### Size of Weight Matrices

We've seen that a single-layer rnn has the following form:

\begin{equation}
\begin{aligned}
h_t &= f_H\big(W_1x_t + W_2h_{t-1} + b\big)\\
y_t &= f_O\big(W^Oh_t + b^O\big)
\end{aligned}
\end{equation}

Let $I=length(x_t)$ be the size of the input, and $H=length(h_t)$ be the size of the hidden output unit. Then, the size of the weight matrices are $IH$ for $W_1$ and $HH$ for $W_2$. With $O=length(y_t)$, we may re-write these equations to include the size of the weight matrices as:

\begin{equation}
\begin{aligned}
h_t &= f_H\big(W_{IH}x_t + W_{HH}h_{t-1} + b_H\big)\\
y_t &= f_O\big(W_{HO}h_t + b_O\big)
\end{aligned}
\end{equation}

### Multi-layer MLP RNN

For a single-layer RNN it was easy to understand the implementation since $h^{l-1}_t\equiv x_t$ with $l=1$. To illustrate how this works for multi-layer RNNs, first note that the output of an activation function $h^l_t$ is in essence passed forward through the RNN at time $t$, but delayed within its own layer to be passed at a later time $t+1$:

<img src="../images/MLP_rnn_full.jpg" alt="Drawing" style="width: 200px;"/>

In addition, while so far we have assumed that each RNN layer has identical form, i.e. they are all represented by the same structure $A$, in general each layer may have different structure $A_l$. For example, a generic RNN could be created from a combination of the following layers:

In [9]:
function mlp_rnn_tanh(param, hₜ₋, xₜ)
    input  = hcat(hₜ₋, xₜ)
    hₜ = tanh(input * param[1] .+ param[2])
    return (hₜ, xₜ)
end

function mlp_rnn_relu(param, hₜ₋, xₜ)
    input  = hcat(hₜ₋, xₜ)
    hₜ = max.(0, input * param[1] .+ param[2])
    return (hₜ, xₜ)
end

mlp_rnn_relu (generic function with 1 method)