RNN is a type of neural network designed for sequential data processing. Sequential processing means where the output depends on the sequence of inputs provided. RNN models can maintain the internal state or memory of the sequence processed, which makes them capable to perform the tasks where the sequence of input or context of the information is considered important. RNNs are mostly used in Time series predictions, NLPs (natural language processing) etc. For example, you typed a sentence That cats are……… Next possible word can be cute, lovely etc. For the sentence cats sometimes……. Next possible word can be hunt, fight etc. These predictions are done based on the context in which the word cat is used. This is done by RNN. RNN model is trained in such a way, where it learns about the contextual relationships of words for a significant amount of training data (by maintaining the internal state of sequences). This is just a simple example of RNN in NLPs, other more complex use cases are possible.

Above is a simplified architecture of RNN where X represents the input at each time step(0,1,2), y is the output and h is the hidden state. In RNN, each cell takes the input and the hidden state from previous cell in order to provide the output. Let’s understand it’s layering architecture.

Input Layer

This is the layer which receives the input sequence at each timestep The input to RNN is typically a vector representing the features of the input data at specific timestep. The input layer processes this vector and passes the information to next layer or recurrent layer, which maintains a hidden state capturing the context from previous timesteps. The input at each time step (xt) is fed into the network, and the RNN updates its hidden state based on this input and the hidden state from the previous time step. Mathematically, the update equation for the hidden state (ht) in a basic RNN can be expressed as:

ht = (Wi * xt + Wh * ht-1 + b)

Where ht is a hidden state at timestep t, Wi is the weight of input, xt is the input at timestep t, Wh is the weight of hidden state, ht-1 is the hidden state at timestep t-1 (previous step), and b is bias term.

Recurrent Layer

This is also known as hidden layer of RNN, which processes the input sequence along with the hidden state from previous timestep and capturing contextual information of the input. In a basic RNN, the recurrent layer consists of recurrent units or cells. These cells maintain a hidden state that evolves over time, allowing the network to maintain information about past inputs. The hidden state at a given time step influences the processing of the input at the next time step, creating a form of memory in the network. The hidden state is calculated in each cell using above equation. These layers are ok till some level but having issues like exploding and vanishing gradient while capturing the long-term dependencies. To overcome such issues, advanced recurrent layers were developed such as LSTM (Long Short-Term Memory) and GRU (Gate Recurrent Unit). 

Long Short-Term Memory (LSTM)

These are the advanced version of simple recurrent layer which were developed to overcome some of the limitations of traditional RNNs described above. The LSTM cells are capable to store information over longer sequences effectively. The gating mechanism is used in LSTM to control the flow of information which mainly includes Input Gate, Forget Gate and Output Gate. As the name signifies, Input gates are responsible to decide what to update, forget gate decides what to forget and similarly output gate decides what to output.

Gated Recurrent Units (GRU)

This also uses the same gates mechanism to input and forget information like LSTMs but are relatively simpler than those and using reset and update gates to control the information.

Output Layer

This is the final layer of RNN which is a fully connected layer. This layer produces the output base don the information captured by the recurrent or hidden layers throughout the sequential processing of input data. Alike other models, activation functions can be applied to the output layer to produce the output. For example, for regression (like future stock price prediction based on some past attributes) linear function can be used, for binary classification (either 0 or 1, or if the sentiment is positive or negative), sigmoid can be used, or SoftMax can be used for multi class classification.

Embedding Layer

This is a type of hidden layer, which is mostly used in NLPs tasks where a sequence of words is processed, such as text classification, question answering etc. The purpose of embedding layer to convert input values such as indices of word in vocabulary to continuous-valued vectors. This layer helps network to learn more about relationship within inputs to capture the contextual information. Detailed information about word embeddings is in Natural Language Processing (NLP) section.