Normalization (machine learning)

In machine learning, normalization is a statistical technique with various applications. There are mainly two forms of normalization, data normalization and activation normalization. Data normalization, or feature scaling, is a general technique in statistics, and it includes methods that rescale input data so that they have well-behaved range, mean, variance, and other statistical properties. Activation normalization is specific to deep learning, and it includes methods that rescale the activation of hidden neurons inside a neural network.

Normalization is often used for faster training convergence, less sensitivity to variations in input data, less overfitting, and better generalization to unseen data. They are often theoretically justified as reducing covariance shift, smoother optimization landscapes, increasing regularization, though they are mainly justified by empirical success.^[1]

Batch normalization

Batch normalization (BatchNorm)^[2] operates on the activations of a layer for each mini-batch.

Consider a simple feedforward network, defined by chaining together modules: $x^{(0)}\mapsto x^{(1)}\mapsto x^{(2)}\mapsto \cdots$ where each network module can be a linear transform, a nonlinear activation function, a convolution, etc. $x^{(0)}$ is the input vector, $x^{(1)}$ is the output vector from the first module, etc.

BatchNorm is a module that can be inserted at any point in the feedforward network. For example, suppose it is inserted just after $x^{(l)}$ , then the network would operate accordingly: $\cdots \mapsto x^{(l)}\mapsto \mathrm {BN} (x^{(l)})\mapsto x^{(l+1)}\mapsto \cdots$ The BatchNorm module does not operate over individual inputs. Instead, it must operate over one batch of inputs at a time.

Concretely, suppose we have a batch of inputs $x_{(1)}^{(0)},x_{(2)}^{(0)},\dots ,x_{(B)}^{(0)}$ , fed all at once into the network. We would obtain in the middle of the network some vectors $x_{(1)}^{(l)},x_{(2)}^{(l)},\dots ,x_{(B)}^{(l)}$ The BatchNorm module computes the coordinate-wise mean and variance of these vectors: ${\begin{aligned}\mu _{i}^{(l)}&={\frac {1}{B}}\sum _{b=1}^{B}x_{(b),i}^{(l)}\\(\sigma _{i}^{(l)})^{2}&={\frac {1}{B}}\sum _{b=1}^{B}(x_{(b),i}^{(l)}-\mu _{i}^{(l)})^{2}\end{aligned}}$ where $i$ indexes the coordinates of the vectors, and $b$ indexes the elements of the batch. In other words, we are considering the $i$ -th coordinate of each vector in the batch, and computing the mean and variance of this collection of numbers.

It then normalizes each coordinate to have zero mean and unit variance: ${\hat {x}}_{(b),i}^{(l)}={\frac {x_{(b),i}^{(l)}-\mu _{i}^{(l)}}{\sqrt {(\sigma _{i}^{(l)})^{2}+\epsilon }}}$ The $\epsilon$ is a small positive constant such as $10^{-8}$ added to the variance for numerical stability, to avoid division by zero.

Finally, it applies a linear transform: $y_{(b),i}^{(l)}=\gamma _{i}{\hat {x}}_{(b),i}^{(l)}+\beta _{i}$ Here, $\gamma$ and $\beta$ are parameters inside the BatchNorm module. They are learnable parameters, typically trained by gradient descent.

The following code illustrates BatchNorm.

import numpy as np

def batchnorm(x, gamma, beta, epsilon=1e-8):
    # Mean and variance of each feature
    mu = np.mean(x, axis=0)  # shape (N,)
    sigma2 = np.var(x, axis=0)  # shape (N,)

    # Normalize the activations
    x_hat = (x - mu) / np.sqrt(sigma2 + epsilon)  # shape (B, N)

    # Apply the linear transform
    y = gamma * x_hat + beta  # shape (B, N)

    return y

Interpretation

$\gamma$ and $\beta$ allow the network to learn to undo the normalization if that is beneficial.^[3] Because a neural network can always be topped with a linear transform layer on top, BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus purely on modelling the nonlinear aspects of data.^[4]^[3]

It is claimed in the original publication that BatchNorm works by reducing "internal covariance shift", though the claim has both supporters^[5]^[6] and detractors.^[7]^[8]

Special cases

The original paper^[2] recommended to only use BatchNorms after a linear transform, not a nonlinear activation. That is, something like $\mathrm {BN} (Wx+b)$ , not $\mathrm {BN} (\phi (Wx+b))$ . Also, the bias $b$ does not matter, since will be canceled by the subsequent mean subtraction, so it is of form $\mathrm {BN} (Wx)$ . That is, if a BatchNorm is preceded by a linear transform, then that linear transform's bias term is set to constant zero.^[2]

For convolutional neural networks (CNN), BatchNorm must preserve the translation invariance of CNN, which means that it must treat all outputs of the same kernel as if they are different data points within a batch.^[2]

Concretely, suppose we have a 2-dimensional convolutional layer defined by $x_{h,w,c}^{(l)}=\sum _{h',w',c'}K_{h'-h,w'-w,c,c'}^{(l)}x_{h',w',c'}^{(l-1)}+b_{c}^{(l)}$ where

$x_{h,w,c}^{(l)}$ is the activation of the neuron at position $(h,w)$ in the $c$ -th channel of the $l$ -th layer.
$K_{\Delta h,\Delta w,c,c'}^{(l)}$ is a kernel tensor. Each channel $c$ corresponds to a kernel $K_{h'-h,w'-w,c,c'}^{(l)}$ , with indices $\Delta h,\Delta w,c'$ .
$b_{c}^{(l)}$ is the bias term for the $c$ -th channel of the $l$ -th layer.

In order to preserve the translational invariance, BatchNorm treats all outputs from the same kernel in the same batch as more data in a batch.

That is, it is applied once per kernel $c$ (equivalently, once per channel $c$ ), not per activation $x_{h,w,c}^{(l+1)}$ : ${\begin{aligned}\mu _{c}^{(l)}&={\frac {1}{BHW}}\sum _{b=1}^{B}\sum _{h=1}^{H}\sum _{w=1}^{W}x_{(b),h,w,c}^{(l)}\\(\sigma _{c}^{(l)})^{2}&={\frac {1}{BHW}}\sum _{b=1}^{B}\sum _{h=1}^{H}\sum _{w=1}^{W}(x_{(b),h,w,c}^{(l)}-\mu _{c}^{(l)})^{2}\end{aligned}}$ where $B$ is the batch size, $H$ is the height of the feature map, and $W$ is the width of the feature map.

That is, even though there are only $B$ data points in a batch, all $BHW$ outputs from the kernel in this batch are treated equally.^[2]

Subsequently, normalization and the linear transform is also done per kernel: ${\begin{aligned}{\hat {x}}_{(b),h,w,c}^{(l)}&={\frac {x_{(b),h,w,c}^{(l)}-\mu _{c}^{(l)}}{\sqrt {(\sigma _{c}^{(l)})^{2}+\epsilon }}}\\y_{(b),h,w,c}^{(l)}&=\gamma _{c}{\hat {x}}_{(b),h,w,c}^{(l)}+\beta _{c}\end{aligned}}$ Similar considerations apply for BatchNorm for n-dimensional convolutions.

The following code illustrates BatchNorm for 2D convolutions:

import numpy as np

def batchnorm_cnn(x, gamma, beta, epsilon=1e-8):
    # Calculate the mean and variance for each channel.
    mean = np.mean(x, axis=(0, 1, 2), keepdims=True)
    var = np.var(x, axis=(0, 1, 2), keepdims=True)

    # Normalize the input tensor.
    x_hat = (x - mean) / np.sqrt(var + epsilon)

    # Scale and shift the normalized tensor.
    y = gamma * x_hat + beta

    return y

Layer normalization

Layer normalization (LayerNorm)^[9] is a common competitor to BatchNorm. Unlike BatchNorm, which normalizes activations across the batch dimension for a given feature, LayerNorm normalizes across all the features within a single data sample. Compared to BatchNorm, LayerNorm's performance is not affected by batch size. It is a key component of Transformers.

For a given data input and layer, LayerNorm computes the mean ( $\mu$ ) and variance ( $\sigma ^{2}$ ) over all the neurons in the layer. Similar to BatchNorm, learnable parameters $\gamma$ (scale) and $\beta$ (shift) are applied. It is defined by: ${\hat {x_{i}}}={\frac {x_{i}-\mu }{\sqrt {\sigma ^{2}+\epsilon }}},\quad y_{i}=\gamma _{i}{\hat {x_{i}}}+\beta _{i}$ where $\mu ={\frac {1}{D}}\sum _{i=1}^{D}x_{i}$ and $\sigma ^{2}={\frac {1}{D}}\sum _{i=1}^{D}(x_{i}-\mu )^{2}$ , and $i$ ranges over the neurons in that layer.

Examples

For example, in CNN, a LayerNorm applies to all activations in a layer. In the previous notation, we have ${\begin{aligned}\mu ^{(l)}&={\frac {1}{HWC}}\sum _{h=1}^{H}\sum _{w=1}^{W}\sum _{c=1}^{C}x_{h,w,c}^{(l)}\\(\sigma ^{(l)})^{2}&={\frac {1}{HWC}}\sum _{h=1}^{H}\sum _{w=1}^{W}\sum _{c=1}^{C}(x_{h,w,c}^{(l)}-\mu ^{(l)})^{2}\\{\hat {x}}_{h,w,c}^{(l)}&={\frac {{\hat {x}}_{h,w,c}^{(l)}-\mu ^{(l)}}{\sqrt {(\sigma ^{(l)})^{2}+\epsilon }}}\\y_{h,w,c}^{(l)}&=\gamma ^{(l)}{\hat {x}}_{h,w,c}^{(l)}+\beta ^{(l)}\end{aligned}}$ notice that the batch index $b$ is removed, while the channel index $c$ is added.

In recurrent neural networks^[9] and Transformers,^[10] LayerNorm is applied individually to each timestep.

For example, if the hidden vector in an RNN at timestep $t$ is $x^{(t)}\in \mathbb {R} ^{D}$ where $D$ is the dimension of the hidden vector, then LayerNorm will be applied with ${\hat {x_{i}}}^{(t)}={\frac {x_{i}^{(t)}-\mu ^{(t)}}{\sqrt {(\sigma ^{(t)})^{2}+\epsilon }}},\quad y_{i}^{(t)}=\gamma _{i}{\hat {x_{i}}}^{(t)}+\beta _{i}$ where $\mu ^{(t)}={\frac {1}{D}}\sum _{i=1}^{D}x_{i}^{(t)}$ and $(\sigma ^{(t)})^{2}={\frac {1}{D}}\sum _{i=1}^{D}(x_{i}^{(t)}-\mu ^{(t)})^{2}$ .

Root mean square layer normalization

Root mean square layer normalization (RMSNorm)^[11] changes LayerNorm by ${\hat {x_{i}}}={\frac {x_{i}}{\sqrt {{\frac {1}{D}}\sum _{i=1}^{D}x_{i}^{2}}}},\quad y_{i}=\gamma {\hat {x_{i}}}+\beta$ Essentially it is LayerNorm where we enforce $\mu ,\epsilon =0$ .

Other normalizations

Weight normalization (WeightNorm)^[12] is a technique inspired by BatchNorm. It normalizes weight matrices in a neural network, rather than its neural activations.

Gradient normalization (GradNorm)^[13] normalizes gradient vectors during backpropagation.

CNN-specific normalization

There are some activation normalization techniques that are only used for CNNs.

Group normalization

Group normalization (GroupNorm)^[14] is a technique only used for CNNs. It can be understood as the LayerNorm for CNN applied once per channel-group.

Suppose at a layer $l$ , there are channels $1,2,\dots ,C$ , then we partition it into groups $g_{1},\dots ,g_{G}$ . Then, we apply LayerNorm to each group.

Instance normalization

Instance normalization (InstanceNorm), or contrast normalization, is a technique first developed for neural style transfer, and is only used for CNNs.^[15] It can be understood as the LayerNorm for CNN applied once per channel, or equivalently, as group normalization where each group consists of a single channel: ${\begin{aligned}\mu _{c}^{(l)}&={\frac {1}{HW}}\sum _{h=1}^{H}\sum _{w=1}^{W}x_{h,w,c}^{(l)}\\(\sigma _{c}^{(l)})^{2}&={\frac {1}{HW}}\sum _{h=1}^{H}\sum _{w=1}^{W}(x_{h,w,c}^{(l)}-\mu _{c}^{(l)})^{2}\\{\hat {x}}_{h,w,c}^{(l)}&={\frac {{\hat {x}}_{h,w,c}^{(l)}-\mu _{c}^{(l)}}{\sqrt {(\sigma _{c}^{(l)})^{2}+\epsilon }}}\\y_{h,w,c}^{(l)}&=\gamma _{c}^{(l)}{\hat {x}}_{h,w,c}^{(l)}+\beta _{c}^{(l)}\end{aligned}}$

Adaptive instance normalization

Adaptive instance normalization (AdaIN) is a variant of instance normalization, designed specifically for neural style transfer with CNN, not for CNN in general.^[16]

In the AdaIN method of style transfer, we take a CNN, and two input images, one content and one style. Each image is processed through the same CNN, and at a certain layer $l$ , the AdaIn is applied.

Let $x^{(l),{\text{ content}}}$ be the activation in the content image, and $x^{(l),{\text{ style}}}$ be the activation in the style image. Then, AdaIn first computes the mean and variance of the activations of the content image $x'^{(l)}$ , then use those as the $\gamma ,\beta$ for InstanceNorm on $x^{(l),{\text{ content}}}$ . Note that $x^{(l),{\text{ style}}}$ itself remains unchanged. Explicitly, we have ${\begin{aligned}y_{h,w,c}^{(l),{\text{ content}}}&=\sigma _{c}^{(l),{\text{ style}}}\left({\frac {x_{h,w,c}^{(l),{\text{ content}}}-\mu _{c}^{(l),{\text{ content}}}}{\sqrt {(\sigma _{c}^{(l),{\text{ content}}})^{2}+\epsilon }}}\right)+\mu _{c}^{(l),{\text{ style}}}.\end{aligned}}$

References

^ Huang, Lei (2022). Normalization Techniques in Deep Learning. Synthesis Lectures on Computer Vision. Cham: Springer International Publishing. doi:10.1007/978-3-031-14595-7. ISBN 978-3-031-14594-0.
^ ^a ^b ^c ^d ^e Ioffe, Sergey; Szegedy, Christian (2015-06-01). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 448–456. arXiv:1502.03167.
^ ^a ^b Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "8.7.1. Batch Normalization". Deep learning. Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03561-3.
^ Desjardins, Guillaume; Simonyan, Karen; Pascanu, Razvan; kavukcuoglu, koray (2015). "Natural Neural Networks". Advances in Neural Information Processing Systems. 28. Curran Associates, Inc.
^ Xu, Jingjing; Sun, Xu; Zhang, Zhiyuan; Zhao, Guangxiang; Lin, Junyang (2019). "Understanding and Improving Layer Normalization". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc. arXiv:1911.07013.
^ Awais, Muhammad; Bin Iqbal, Md. Tauhid; Bae, Sung-Ho (November 2021). "Revisiting Internal Covariate Shift for Batch Normalization". IEEE Transactions on Neural Networks and Learning Systems. 32 (11): 5082–5092. doi:10.1109/TNNLS.2020.3026784. ISSN 2162-237X. PMID 33095717.
^ Bjorck, Nils; Gomes, Carla P; Selman, Bart; Weinberger, Kilian Q (2018). "Understanding Batch Normalization". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc. arXiv:1806.02375.
^ Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018). "How Does Batch Normalization Help Optimization?". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.
^ ^a ^b Ba, Jimmy Lei; Kiros, Jamie Ryan; Hinton, Geoffrey E. (2016). "Layer Normalization". arXiv:1607.06450. {{cite journal}}: Cite journal requires |journal= (help)
^ Phuong, Mary; Hutter, Marcus (2022-07-19), Formal Algorithms for Transformers, doi:10.48550/arXiv.2207.09238, retrieved 2024-08-08
^ Zhang, Biao; Sennrich, Rico (2019-10-16), Root Mean Square Layer Normalization, doi:10.48550/arXiv.1910.07467, retrieved 2024-08-07
^ Salimans, Tim; Kingma, Diederik P. (2016-06-03), Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, doi:10.48550/arXiv.1602.07868, retrieved 2024-08-08
^ Chen, Zhao; Badrinarayanan, Vijay; Lee, Chen-Yu; Rabinovich, Andrew (2018-07-03). "GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks". Proceedings of the 35th International Conference on Machine Learning. PMLR: 794–803.
^ Wu, Yuxin; He, Kaiming (2018). "Group Normalization": 3–19. {{cite journal}}: Cite journal requires |journal= (help)
^ Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (2017-11-06), Instance Normalization: The Missing Ingredient for Fast Stylization, doi:10.48550/arXiv.1607.08022, retrieved 2024-08-08
^ Huang, Xun; Belongie, Serge (2017). "Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization": 1501–1510. {{cite journal}}: Cite journal requires |journal= (help)

[1] Huang, Lei (2022). Normalization Techniques in Deep Learning. Synthesis Lectures on Computer Vision. Cham: Springer International Publishing. doi:10.1007/978-3-031-14595-7. ISBN 978-3-031-14594-0.

[:0-2] Ioffe, Sergey; Szegedy, Christian (2015-06-01). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 448–456. arXiv:1502.03167.

[:1-3] Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "8.7.1. Batch Normalization". Deep learning. Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03561-3.

[4] Desjardins, Guillaume; Simonyan, Karen; Pascanu, Razvan; kavukcuoglu, koray (2015). "Natural Neural Networks". Advances in Neural Information Processing Systems. 28. Curran Associates, Inc.

[5] Xu, Jingjing; Sun, Xu; Zhang, Zhiyuan; Zhao, Guangxiang; Lin, Junyang (2019). "Understanding and Improving Layer Normalization". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc. arXiv:1911.07013.

[6] Awais, Muhammad; Bin Iqbal, Md. Tauhid; Bae, Sung-Ho (November 2021). "Revisiting Internal Covariate Shift for Batch Normalization". IEEE Transactions on Neural Networks and Learning Systems. 32 (11): 5082–5092. doi:10.1109/TNNLS.2020.3026784. ISSN 2162-237X. PMID 33095717.

[7] Bjorck, Nils; Gomes, Carla P; Selman, Bart; Weinberger, Kilian Q (2018). "Understanding Batch Normalization". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc. arXiv:1806.02375.

[8] Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018). "How Does Batch Normalization Help Optimization?". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.

[:2-9] Ba, Jimmy Lei; Kiros, Jamie Ryan; Hinton, Geoffrey E. (2016). "Layer Normalization". arXiv:1607.06450. {{cite journal}}: Cite journal requires |journal= (help)

[10] Phuong, Mary; Hutter, Marcus (2022-07-19), Formal Algorithms for Transformers, doi:10.48550/arXiv.2207.09238, retrieved 2024-08-08

[11] Zhang, Biao; Sennrich, Rico (2019-10-16), Root Mean Square Layer Normalization, doi:10.48550/arXiv.1910.07467, retrieved 2024-08-07

[12] Salimans, Tim; Kingma, Diederik P. (2016-06-03), Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks, doi:10.48550/arXiv.1602.07868, retrieved 2024-08-08

[13] Chen, Zhao; Badrinarayanan, Vijay; Lee, Chen-Yu; Rabinovich, Andrew (2018-07-03). "GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks". Proceedings of the 35th International Conference on Machine Learning. PMLR: 794–803.

[14] Wu, Yuxin; He, Kaiming (2018). "Group Normalization": 3–19. {{cite journal}}: Cite journal requires |journal= (help)

[15] Ulyanov, Dmitry; Vedaldi, Andrea; Lempitsky, Victor (2017-11-06), Instance Normalization: The Missing Ingredient for Fast Stylization, doi:10.48550/arXiv.1607.08022, retrieved 2024-08-08

[16] Huang, Xun; Belongie, Serge (2017). "Arbitrary Style Transfer in Real-Time With Adaptive Instance Normalization": 1501–1510. {{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]