机器翻译中参数的初始化

motivation

The widely used neural machine translation model is seq2seq model with encoder and decoder. The Transformer model (2017) has reached new state of the art in nmt. It has many parameters to decide and train: hyper-parameters and model parameters. We choose the hyper-parameters empirically and based on the effect on development set. The model parameters are initialized and trained by SGD. It is important to set the proper initial model-paremeters to help find optimized model settings more quickly and effectively.

according to 1702.08591, Deep neural networks have achieved outstanding performance. Reducing the tendency of gradients to vanish or explode with depth has been essential to this progress. Combining careful initialization with batch normalization bakes two solutions to the vanishing/exploding gradient problem into a single architecture. The He initialization ensures variance is preserved across rectifier layers, and batch normalization ensures that backpropagation through layers is unaffected by the scale of the weights.

The shattering gradients problem is that, as depth increases, gradients in standard feedforward networks increasingly resemble white noise. Resnets dramatically reduce the tendency of gradients to shatter.

methodology

1 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

according to 1312.6120, “We empirically show that if we choose the initial weights in each layer to be a random orthogonal matrix, instead of a scaled random Gaussian matrix, then this orthogonal random initialization condition yields depth independent learning times just like greedy layerwise pre-training (indeed the red and green curves are indistinguishable).” That is to say greedy pre-training and random orthogonal initialization is recommended.

2 Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

according to 1502.01852, in the past, deep CNNs are mostly initialized by random weights drawn from Gaussian distributions. Then Glorot and Bengio proposed to adopt a properly scaled uniform distribution for initialization. This is called “Xavier” initialization. In section “Forward propagation case”, He raised his initialization method: when ReLU/PReLU is applied as activation function, for each layer of the network, the weight parameters should be a zero-mean Gaussian distribution and its standard deviation is $ $.

3 ALL YOU NEED IS A GOOD INIT

according to 1511.06422, it comes up with a noval method: LSUV initialization. The LSUV process then estimates output variance of each convolution and inner product layer and scales the weight to make variance equal to one. Batch normalization, a technique that inserts layers into the the deep net that transform the output for the batch to be zero mean unit variance, has successfully facilitated training of the twenty-two layer GoogLeNet. However, batch normalization adds a 30% computational overhead to each iteration.

First, fill the weights with Gaussian noise with unit variance. Second, decompose them to orthonormal basis with QR or SVD decomposition and replace weights with one of the components. Third, iterally compute the weight matrix.

for each layer L do
	while |Var(B_L)- 1|>= TOL_{var} and  T_i < T_{max}  do
		do Forward pass with a mini-batch
		calculate Var(B_L)
		 W_L = W_L / sqrt{Var(B_L)}
	end while
end for

Among the above Layer sequential unit-variance orthogonal initialization algorithm,

L – convolution or fullconnected layer,
$W_L$ - its weights,
$B_L$ - its output blob.,
$Tol_{var}$ - variance tolerance,
$ T_i $ , current trial,
$ T_{max} $ , max number of trials.

Reference

1312.6120, Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
1502.01852, Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
1511.06422, ALL YOU NEED IS A GOOD INIT
1702.08591, The Shattered Gradients Problem: If resnets are the answer, then what is the question?