Architecture

This chapter collects some tricky design in various types of neural networks.

Skip Connection

It is first raised in ResNet as a .

Connections between ResNet, DenseNet and Higher Order RNN

Residual networks essentially belong to the family of densely connected networks except that their connections are shared across steps.

Prove: (The following prove is provided by Dual Path Networks by Pengcheng Yun)

We use hth^t to denote the hidden state of the recurrent neural network at the ttht-th step and use kk as the index of the current step. Let xtx^t denotes the input at ttht-th step, h0=x0h^0 = x^0. For each step, ftk()f^k_t(·) refers to the feature extracting function which takes the hidden state as input and outputs the extracted information. The gk()g^k(·) denotes a transformation function that transforms the gathered information to current hidden state:

hk=gk[t=0k1ftk(ht)]h^k = g^k[\sum_{t=0}^{k-1}f_t^k(h^t)]

For HORNNs, weights are shared across steps, i.e.t,k,ftk()ft()i.e. \forall t,k,f^k_t(·) ≡ f_t(·) and k,gk()g()\forall k, g^k(·) ≡ g(·). For the densely connected networks, each step (micro-block) has its own parameter, which means ftk()f^k_t (·) and gk()g^k(·) are not shared.

rkt=1k1ft(ht)=rk1+fk1(hk1)hk=gk(rk)rk=rk1+fk1(hk1)=rk1+fk1(gk1(rk1))=rk1+ϕk1(rk1)r^k \triangleq \sum_{t=1}^{k-1}f_t(h^t) = r^{k-1} + f^{k-1}(h^{k-1}) \\ h^k = g^k(r^k) \\ \Longrightarrow r^k = r^{k-1} + f_{k-1}(h^{k-1}) = r^{k-1}+f^{k-1}(g^{k-1}(r^{k-1})) = r^{k-1} + \phi^{k-1}(r^{k-1})

Specifically, when k,ϕk()=ϕ()\forall k, \phi^k(·) = \phi(·), the above equation degenerates to an RNN; when none of ϕk()\phi ^k(·) is shared and xk=0,k>1x^k = 0,k > 1, it produces a residual network.

Last updated