In this blog post, we will go over the ICLR 2022 paper titled NETWORK AUGMENTATION FOR TINY DEEP LEARNING. This paper introduces a new training method for improving the performance of tiny neural networks. NetAug augments the network (reverse dropout), it puts the tiny model into larger models and encourages it to work as a sub-model of larger models to get extra supervision.
Network Augmentation aka NetAug
NetAug solely focuses on improving the performance of the tiny neural networks during inference, whilst optimizing their memory footprint to deploy them on edge devices. Tiny neural networks are usually inclined to underfit. Hence, the traditional training paradigms will not work for these small models because they fundamentally tackle the problem of overfitting and not overcome the underfitting issue. Several techniques in the recent time like data augmentation, pruning, dropout, knowledge distillation have been proposed to improve the generalizability of neural networks.
The end goal of any ML model is to be able to minimize the loss function with the help of gradient descent. Since tiny neural network have a very small capacity the gradient descent is likely to get stuck in local minima. Instead of traditional regularization techniques which add noise to data and model, NetAug proposes a way to increase the capacity of the tiny model without changing its architecture for efficient deployment and inference on edge devices. This is done by augmenting the tiny model (referred to as base model) into a larger model and jointly training both the base model independently and also the augmented model so that the base model benefits from the extra supervision it receives from the augmented model. However, during inference only the base model is used.
To speed-up the training, a single largest augmented model is constructed by augmenting the width of the each layer of the base model using an augmentation factor \(r\). After building the largest augmented model, we construct other augmented models by selecting a subset of channels from the largest augmented model. NetAug proposes a hyper-parameter \(s\), named diversity factor, to control the number of augmented model configurations. We set the augmented widths to be linearly spaced between \(w\) and \(r \times w\). For instance, with \(r = 3\) and \(s = 2\), the possible widths would be \([w, 2w, 3w]\).
Knowledge distillation in learned models is constituted of:
We specifically aim at training the teacher and student model in a online setting, in this online type of distillation training method both the teacher and the student model are trained together simultaneously.
Relational Knowledge distillation can be expressed as -
Given a teacher model \(\:T\:\) and a student model \(S\), we denote \(f_T(\cdot)\) and \(f_S(\cdot)\) as the functions of the teacher and the student, respectively, and \(\psi\) as a function extracting the relation, we have
where \(\mathcal{L}_{\text{RKD}}\) is the loss function, \(t_i = f_T(x_i)\) and \(s_i = f_S(x_i)\) and \(x_i \in \chi\) denotes the input data.
This method is known as RKD-D. It transfers relative distance between points on embedding space. Mathematically, \(\begin{equation}\psi_{D}(t_i, t_j) = \frac{1}{\mu}\big\| t_i - t_j\big\|_2\end{equation}\) where \(\psi_d(\cdot, \cdot)\) denotes distance wise potential function \(\begin{equation}\mu = \frac{1}{|\chi^2|}\sum\limits_{(x_i, x_j) \in \chi^2} \big\| t_i - t_j\big\|_2\end{equation}\) \(\begin{equation}\boxed{\mathcal{L}_{\text{RKD-D}} = \sum \limits_{(x_i, x_j) \in \chi^2} l_\delta \big(\psi_D(t_i, t_j), \psi_D(s_i, s_j)\big)}\end{equation}\) where \(l_{\delta}\) denotes the Huber Los
\[\begin{equation}l_\delta(x, y) = \begin{cases} \frac{1}{2} (x-y)^2\:\:\:\: \text{for } |x-y| \leq 1 \\ |x - y| - \frac{1}{2}\:\:\: \text{otherwise.} \end{cases}\end{equation}\]This method is known as RKD-A. RKD-A transfers angle formed by three points on embedding space. Mathematically, \(\begin{equation}\psi_{A}(t_i, t_j, t_k) = \cos \angle t_it_jt_k = \langle \boldsymbol{e}^{ij}, \boldsymbol{e}{jk}\rangle\end{equation}\) where \(\psi_A(\cdot, \cdot, \cdot)\) denotes angle wise potential function \(\begin{equation}\boldsymbol{e}^{ij} = \frac{t_i - t_j}{\big\|t_i - t_j\big\|_2}, \: \boldsymbol{e}^{jk} = \frac{t_k - t_j}{\big\|t_k - t_j\big\|_2}\end{equation}\)
\[\begin{equation}\boxed{\mathcal{L}_{\text{RKD-A}} = \sum\limits_{(x_i, x_j, x_k) \in \chi^3} l_\delta \big(\psi_A(t_i, t_j, t_k), \psi_A(s_i, s_j, s_k)\big)}\end{equation}\]We propose the following loss function to solve naive loss problem of NetAug
\[\begin{equation}\mathcal{L}_{\text{aug}} = \underbrace{\mathcal{L}(W_t)}_{\text{base supervision}} \:+\: \underbrace{\alpha_1 \mathcal{L}([W_t, W_1]) + \cdots + \alpha_i \mathcal{L}([W_t, W_i]) + \cdots}_{\text{auxiliary supervision, working as a sub-model of augmented models}} \:+\: \lambda_{\text{KD}}\,\underbrace{\mathbf{\mathcal{L}_{\text{RKD}}}}_{\text{relational knowledge distillation}}\label{eqn:loss_func}\end{equation}\]where \([W_t, W_i]\) represents an augmented model where \([W_t]\) represents the tiny neural network and \([W_i]\) contains weight of the sub-model sampled from the largest augmented model, \(\alpha\) is scaling hyper-parameter for combining loss from different augmented models and finally \(\lambda_{\text{KD}}\) is a tunable hyperparameter to balance RKD and NetAug.
In NetAug, they train only one model for every epoch, training all the augmented models all once is not only computationally expensive but also impacts the performance. The proportion of the base supervision will decrease when we sample more augmented networks, which will make the training process biased toward augmented networks and shadows the base model.
To further enhance the auxiliary supervision, we propose to use RKD in an online setting i.e., the largest augmented model will act as a teacher and the base model will act as a student. Both the teacher and the student are trained simultaneously.
We train the both the base model and the augmented model via gradient descent based on the loss function \(\eqref{eqn:loss_func}\). The gradient update for the base model is then given by
\[\begin{equation}W^{n+1}_t = W^n_t - \eta \bigg(\frac{\partial \mathcal{L}(W^n_t)}{\partial W^n_t} + \alpha \frac{\partial \mathcal{L}([W^n_t, W^n_i])}{\partial W^n_t} + \lambda \frac{\partial \mathcal{L}_{\text{RKD}}([W^n_t, W^n_l])}{\partial W^n_t}\bigg)\end{equation}\]Similar update equations can be obtained for the largest augmented model and the sub-models as well.
We propose this method to solve the problem training the generated augmented models. Based on
Mathematical Formulation
Denote the neural model as consisting of \(L\) stacked structurally similar modules as \(\mathcal{M} = \{\mathcal{M}_i\}, \, i=1,\cdots, L\) and \(w = \{w_i\}, \, i=1,\cdots, L\) denote the corresponding weights. These weights are re-parametrized as
\[\begin{equation}w_i = \frac{1}{\sqrt{L}}w_0 + \tilde{w}_i, \:\:\:\: i=1,\cdots,L\end{equation}\]Here \(w_0\) represents the shared weights across all modules and is referred to as stem-direction and \(\tilde{w}_i\) represents the unshared weights across all modules and is referred to as branch-directions.
Training Strategy
Denote \(T\) as the number of training steps, \(\eta\) as the step size and \(\alpha \in (0, 1)\) is a tunable hyper-paramter indicating the fraction of weight sharing steps. Then we train \(\mathcal{M}\) as follows:
The effective gradient updates for \(w_i\) can be found using chain rule as follows:
\[\begin{equation}g_0 = \frac{\partial \mathcal{L}}{\partial w_0} = \sum\limits_{i=1}^L \frac{\partial \mathcal{L}}{w_i}\,\frac{\partial w_i}{\partial w_0} = \frac{1}{\sqrt{L}}\sum\limits_{i=1}^L g_i \end{equation}\] \[\begin{equation}\tilde{g}_i = \frac{\partial \mathcal{L}}{\partial \tilde{w}_i} = \frac{\partial \mathcal{L}}{\partial w_i}\,\frac{\partial w_i}{\partial \tilde{w}_i} = g_i\end{equation}\]where \(g_i\) denotes the gradients of \(w_i\) and \(\mathcal{L}\) denotes the loss function.
We propose this method to solve the problem of generating the largest augmented model. In NetAug, the largest augment model is generated randomly just based on the hyperparameters \(r\) and \(s\). Single Path One-Shot NAS with uniform sampling
Supernet weight Optimization : \(\begin{equation}W_{\mathcal{A}}=\mathop{\arg \min}_{W} \: \mathcal{L}_{\text{train}}\big(\mathcal{N}(\mathcal{A},W)\big)\end{equation}\)
The \(\mathcal{A}\) is the architecture search space represented as a directed acyclic graph which is encoded as a supernet \(\mathcal{N}(\mathcal{A}, W)\). During an SGD step in the above equation, each edge in the supernet graph is randomly dropped, using a dropout rate parameter. In this way, the co-adaptation of the node weights is reduced during training making the supernet training easier.
Architecture Search Optimization : \(\begin{equation}a^* = \mathop{\arg \max}_{a \in \mathcal{A}} \text{ ACC}_{\text{val}}\bigg(\mathcal{N}\big(a,W_{\mathcal{A}}(a)\big)\bigg)\end{equation}\)
During search, each sampled architecture a inherits its weights from \(W_{\mathcal{A}}\) as \(W_{\mathcal{A}}(a)\). The architecture weights are ready to use making the search very efficient and flexible. This type of sequential optimization works because, the accuracy of any architecture \(a\) on a validation set using inherited weight \(W_{\mathcal{A}}(a)\) (without extra fine tuning) is highly predictive for the accuracy of \(a\) that is fully trained. Hence we try to minimize the training loss even further for better performance, supernet weights \(W_{\mathcal{A}}\) such that all the architectures in the search space are optimized simultaneously.
where \(\Gamma(\mathcal{A})\) is the prior distribution of \(a \in \mathcal{A}\). This stochastic training of the supernet is helpful in better generalization of the optimized model and is also computationally efficient. To overcome the problem of weight coupling, the supernet \(\mathcal{N}(\mathcal{A},W)\) is chosen such that each architecture is a single path so that this realization is hyperparameter free as compared to traditional NAS approaches. The distribution \(\Gamma(\mathcal{A})\) is fixed apriori as a uniform distribution during our training and is not a learnable parameter.
Evaluation metric is the core for building any accurate machine learning model. We propose to implement the evaluation metrics precision@\(k\) ,recall@\(k\) and f1-score@\(k\) for the augmented sub-models sampled from largest augmented model and for the base model itself, where \(k\) represents the top k accuracy on the test set. These metrics will assist in evaluating the training procedure better than vanilla accuracy because we are trying to tackle the problem of underfitting, and an underfitted model is more likely to make the same prediction for every input and presence of class imbalance will lead to erroneous results.
The main conclusion of this blog post is to further refine tiny neural networks effectively without any loss in accuracy and prevent underfitting in them. The paper implements this in a unique way apart from the conventional techniques such as regularization, dropout and data augmentation. This blog adds to other techniques that are orthogonal to NetAug can be used combined with NetAug for improvised results.