GRU

Gated Recurrent Unit

GRU is considered to solve the vanishing gradient problem, it is a variation of LSTM. It is an improved version of RNN. GRU uses 2 gates i.e, Update and Reset Gates, this makes it special. Basically, these are two vectors that decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.

Update Gate

This is the single GRU cell. Here we have 2 inputs and 1 output. x(t) is the new information passed and h(t-1) is the previous information, Update gate will decide which information is to be updated or forgotten. That is really powerful because the model can decide to copy all the information from the past and eliminate the risk of vanishing gradient problem. 
So, the sigmoid operation is done with both current input and previous information. In sigmoid operation, In the sigmoid operation, the network decides what information is relevant to continue between the previous output and the new input assigning values from 0 to 1 where 0 goes for the information we want to forget and 1 for the information we want to keep. This result is then multiplied to the previous output to continue for the next step.

Reset Gate

Essentially, this gate is used from the model to decide how much of the past information to forget. The path preceding the first one comes with the combination of the previous output and the current input, here the networks decide ones again what is relevant by sigmoid. The first result from the sigmoid is multiplied with -1 and multiplied with the clean output coming directly from the previous cell. The second result from sigmoid is then multiplied with the result of the activation function tanh applied to the result of the first step gate multiplication.
This last part is carrying the importance of the new information that enters in this particular cell. The idea of separate the result into two paths, one multiplied by 1 and the other by -1 is to reinforce the separation of concern between forget the majority of the previous output or keep it, taking into account that while a state is closing to 0 saying to forget what is coming through the other is close to 1 saying to save what is coming for this side.

This is also similar to LSTM but with some variation version of LSTM RNN.


Comments

Popular posts from this blog

Deep Learning

Loss & Cost Functions

Recurrent Neural Network & LSTM