...in which we find a connection between meta-learning literature and a paper studying how well CNNs deal with nuisance transforms in a class-imbalanced setting. Closer inspection reveals a surprising amount of similarity - from meta-information to loss functions.
At the last ICLR conference, Zhou et al. [2022]
Here is a quick summary of their findings: If we train a Convolutional Neural Net (CNN) to classify fruit on a set of randomly brightened and darkened images of apples and oranges, it will learn to ignore the scene’s brightness. We say that the CNN learned that classification is invariant to the nuisance transformation of randomly changing the brightness of an image. We now add a set of plums to the training data, but fewer examples of them than we have apples and oranges. However, we keep using the same random transformations. The training set thus becomes class-imbalanced.
We might expect a sophisticated learner to look at the entire dataset, recognize the random brightness modifications across all types of fruit and henceforth ignore brightness when making predictions. If this applied to our fruit experiment, the CNN would be similarly good at ignoring lighting variations on all types of fruit. Furthermore, we would expect the CNN to become more competent at ignoring lighting variations in proportion to the total amount of images, irrespective of which fruit they depict.
Zhou et al. [2022]
However, there is a solution: Zhou et al. [2022]
In this blog post, we are going to argue that:
Before we proceed to the main post, let’s clarify some definitions. If you are already familiar with the subject, you may skip this part:
In many real-world classification datasets, the number of examples for each class varies. Class-imbalanced classification refers to classification on datasets where the frequencies of class labels vary significantly.
It is generally more difficult for a neural network to learn to classify classes with fewer examples. However, it is often important to perform well on all classes, regardless of their frequency in the dataset. If we train a model to classify a dataset of different skin tumors, most examples may be benign. Still, it is crucial to identify the rare, malignant ones. Experiment design, including training and evaluation methods must therefore be adjusted when using class-imbalanced data.
Transformations are alterations of data. In the context of image classification, nuisance transformations are alterations that do not affect the class labels of the data. A model is said to be invariant to a nuisance transformation if it can successfully ignore the transformation when predicting a class label.
We can formally define a nuisance transformation$$T(\cdot |x)$$
as a distribution over transformation functions. An example of a nuisance transformation might be a distribution over rotation matrices of different angles, or lighting transformations with different exposure values. By definition, nuisance transformations have no impact on class labels $y$, only on data $x$. A perfectly transformation-invariant classifier would thus completely ignore them, i.e.,
$$ \hat{P}_w(y = j|x) = \hat{P}_w(y = j|x'), \; x' \sim T(\cdot |x). $$
Let’s take a more detailed look at the experiment Zhou et al. [2022]
Zhou et al. [2022]
To measure the invariance of the trained model to the applied transformation Zhou et al. [2022]
$$ eKLD(\hat{P}_w) = \mathbb{E}_{x \sim \mathbb{P}_{test}, x' \sim T(\cdot|x)} [D_{KL}(\hat{P}_w(y = j|x) || \hat{P}_w(y = j|x'))] $$
If the learner is invariant to the transformation, the predicted probability distribution over class labels should be identical for the transformed and untransformed images. In that case, the KLD should be zero and greater than zero otherwise. The higher the expected KL-divergence, the more the applied transformation impacts the network’s predictions.
The result: eKLD falls with class size. This implies that the CNN does not learn that there are the same nuisance transformations on all images and therefore does not transfer this knowledge to the classes with less training data. A CNN learns invariance separately for each class.
You might think this is a cool experiment, but how is it related to meta-learning?
Let’s look at one of the original papers on meta-learning. In the 1998 book “Learning to learn” Sebastian Thrun & Lorien Pratt define an algorithm as capable of “Learning to learn” if it improves its performance in proportion to the number of tasks it is exposed to:
an algorithm is said to learn to learn if its performance at each task improves with experience and with the number of tasks. Put differently, a learning algorithm whose performance does not depend on the number of learning tasks, which hence would not benefit from the presence of other learning tasks, is not said to learn to learn
Now how does this apply to the experiment just outlined? In the introduction, we thought about how a sophisticated learner might handle a dataset like the one described in the last section. We said that a sophisticated learner would learn that the nuisance transformations are applied uniformly to all classes. Therefore, if we added more classes to the dataset, the learner would become more invariant to the transformations because we expose it to more examples of them. Since this is part of the classification task for each class, the learner should, everything else being equal, become better at classification, especially on classes with few training examples. To see this, we must think of the multi-classification task not as a single task but as multiple mappings from image features to activations that must be learned, as a set of binary classification tasks. Thrun and Pratt continue:
For an algorithm to fit this definition, some kind of transfer must occur between multiple tasks that must have a positive impact on expected task-performance
.
This transfer is what Zhou et al. [2022]
Zhou et al. [2022]
MUNIT networks are capable of performing image-to-image translation, which means that they can translate an image from one domain, such as pictures of leopards, into another domain, such as pictures of house cats. The translated image should look like a real house cat while still resembling the original leopard image. For instance, if the leopard in the original image has its eyes closed, the translated image should contain a house cat with closed eyes. Eye state is a feature present in both domains, so a good translator should not alter it. On the other hand, a leopard’s fur is yellow and spotted, while a house cat’s fur can be white, black, grey, or brown. To make the translated images indistinguishable from real house cats, the translator must thus replace leopard fur with house cat fur.
MUNIT networks learn to perform translations by correctly distinguishing the domain-agnostic features (such as eye state) from the domain-specific features (such as the distribution of fur color). They embed an image into two latent spaces: a content space that encodes the domain-agnostic features and a style space that encodes the domain-specific features (see figure above).
To transform a leopard into a house cat, we can encode the leopard into a content and a style code, discard the leopard-specific style code, randomly select a cat-specific style code, and assemble a house cat image that looks similar by combining the leopard’s content code with the randomly chosen cat style code (see figure below).
Zhou et al. [2022]
However, if the domain included house cats and apples, fur color is not a valid style feature. If it was, the translator might translate fur color on an apple and give it black fur, which would look suspiciously out of place. Whatever house cats and apples have in common - maybe their position or size in the frame - would be a valid style feature. We would expect an intra-domain translator on an apples-and-cats dataset to change the position and size of an apple but not to turn it into a cat (not even partially).
It turns out that on a dataset with uniformly applied nuisance transformations, the nuisance transformations are valid style features: The result of randomly rotating an apple cannot be discerned as artificial when images of all classes, house cats and apples, were previously randomly rotated.
Zhou et al. [2022]
So MUNIT can decompose the example-specific information, e.g., whether something is an apple or a house cat, from the meta-information, the applied nuisance transformations. When we add more classes, it has more data and can better learn the transformation distribution T(\cdot |x)$. Does solving a meta-learning problem make MUNIT a meta-learner? Let’s look at the relationship MUNIT has with traditional meta-learners.
To see how well MUNIT fits the definition of meta-learning, let’s define meta-learning more concretely. We define contemporary neural-network-based meta-learners in terms of a learning procedure: An outer training loop with a set of trainable parameters iterates over tasks in a distribution of tasks. Formally a task is comprised of a dataset and a loss function $ \mathcal{T} = \{ \mathcal{D}, \mathcal{L} \} $. In an inner loop, a learning algorithm based on the outer loop’s parameters is instantiated for each task. We train it on a training set (meta-training) and test it on a validation set (meta-validation). We then use loss on this validation set to update the outer loop’s parameters. In this task-centered view of meta-learning, we can express the objective function as
$$ \underset{\omega}{\mathrm{min}} \; \mathbb{E}_{\mathcal{T} \sim p(\mathcal{T})} \; \mathcal{L}(\mathcal{D}, \omega), $$
where $ \omega $ is parameters trained exclusively on the meta-level, i.e., the meta-knowledge learnable from the task distribution
This meta-knowledge is what the meta-learner accumulates and transfers across the tasks in Thrun and Pratt’s definition above. Collecting meta-knowledge allows the meta-learner to improve its expected task performance with the number of tasks. The meta-knowledge in the experiment of Zhou et al. [2022]
The task-centered view of meta-learning brings us to a related issue: A meta-learner must discern and decompose task-specific knowledge from meta-knowledge. Contemporary meta-learners decompose meta-knowledge through the different objectives of their inner and outer loops and their respective loss terms. They store meta-knowledge in the outer loop’s parameter set $ \omega $ but must not learn task-specific information there. Any unlearned meta-features lead to slower adaptation, negatively impacting performance, meta-underfitting. On the other hand, any learned task-specific features will not generalize to unseen tasks in the distribution, thus also negatively impacting performance, meta-overfitting.
We recall that, similarly, MUNIT
Both meta-learning and multi-domain unsupervised image-to-image translation are thus learning problems that require a separation of the general from the specific. This is even visible when comparing their formalizations as optimization problems.
Francheschi et al. [2018]
$$ \bbox[5pt, border: 2px solid blue]{ \begin{align*} \omega^{*} = \underset{\omega}{\mathrm{argmin}} \sum_{i=1}^{M} \mathcal{L}^{meta}(\theta^{* \; (i)}(\omega), D^{val}_i), \end{align*} } $$
where $M$ describes the number of tasks in a batch, $\mathcal{L}^{meta}$ is the meta-loss function, and $ D^{val}_i $ is the validation set of the task $ i $. $\omega$ represents the parameters exclusively updated in the outer loop. $ \theta^{* \; (i)} $ represents an inner loop learning a task that we can formally express as a sub-objective constraining the primary objective
$$ \bbox[5pt, border: 2px solid red]{ \begin{align*} s.t. \; \theta^{* \; (i)} = \underset{\theta}{\mathrm{argmin}} \; \mathcal{L^{task}}(\theta, \omega, D^{tr}_i), \end{align*} } $$
where $ \theta $ are the model parameters updated in the inner loop, $ \mathcal{L}^{task} $ is the loss function by which they are updated and $ D^{tr}_i $ is the training set of the task $ i $
It turns out that the loss functions of MUNIT can be similarly decomposed:
MUNIT’s loss function consists of two adversarial (GAN)
MUNIT’s GAN loss term is
$$ \begin{align*} &\mathcal{L}^{x_{2}}_{GAN}(\theta_d, \theta_c, \theta_s) \\\\ =& \;\mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d)) \right] \\ +& \;\mathbb{E}_{x_{2} \sim p(x_{2})} \left[ \log(D_{2} (x_{2}, \theta_d)) \right], \end{align*} $$
where the $ \theta_d $ represents the parameters of the discriminator network, $p(x_2)$ is the data of the second domain, $ c_1 $ is the content embedding of an image from the first domain to be translated. $ s_2 $ is a random style code of the second domain. $ D_2 $ is the discriminator of the second domain, and $ G_2 $ is its generator. MUNIT’s full objective function is:
$$ \begin{align*} \underset{\theta_c, \theta_s}{\mathrm{argmin}} \; \underset{\theta_d}{\mathrm{argmax}}& \;\mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d)) \right] \\ +& \; \mathbb{E}_{x_{2} \sim p(x_{2})} \left[ \log(D_{2} (x_{2}, \theta_d)) \right], + \; \mathcal{L}^{x_{1}}_{GAN}(\theta_d, \theta_c, \theta_s) \\ +& \;\mathcal{L}_{recon}(\theta_c, \theta_s) \end{align*} $$
(compare
$$ \bbox[5px, border: 2px solid blue]{ \begin{align*} \omega^{*} & = \{ \theta_c^*, \theta_s^* \} \\\\ & = \underset{\theta_c, \theta_s}{\mathrm{argmin}} \; \mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d^{*})) \right] \\ & + \mathcal{L}_{recon}(\theta_c, \theta_s), \end{align*} } $$
We then add a single constraint, a subsidiary maximization problem for the discriminator function:
$$ \bbox[5px, border: 2px solid red]{ \begin{align*} &s.t. \;\theta_d^{*} \\\\ & = \underset{\theta_d}{\mathrm{argmax}} \; \mathbb{E}_{c_{1} \sim p(c_{1}), s_{2} \sim p(s_{2})} \left[ \log (1 -D_ {2} (G_{2} (c_{1}, s_{2}, \theta_c, \theta_s), \theta_d)) \right] \\ & + \mathbb{E}_{x_{2} \sim p(x_{2})} \left[ \log(D_{2} (x_{2}, \theta_d)) \right] \end{align*} } $$
Interestingly, this bi-level view does not only resemble a meta-learning procedure as expressed above, but the bi-level optimization also facilitates a similar effect. Maximizing the discriminator’s performance in the constraint punishes style information encoded as content information. If style information is encoded as content information, the discriminator detects artifacts of the original domain in the translated image. Similarly, a meta-learner prevents meta-overfitting via an outer optimization loop.
The two procedures also share “indirect” parameter updates. During GAN training, the discriminator’s parameters are updated through the changes in the generator’s parameters, which derive from the discriminator’s previous parameters, and so forth; The training of the discriminator and generator are separate but dependent processes. Similarly, in a meta-learner, the outer loop impacts the inner loop by determining its initial learner. The results of the inner loop, meanwhile, impact the outer loop via the meta-validation loss. While often overlapping in practice, the inner and outer loop parameter sets could, in principle, be completely disjunct, as they are in GAN training.
Concluding, we discern between MUNIT applied to two domains versus a single domain. When applied to two domains, MUNIT is a binary image-to-image translation architecture, i.e., we cannot add domains. For multi-domain translation, we need to train many MUNIT networks to translate between pairs of domains. Thus, although it uses mechanisms similar to a meta-learner, it is not a meta-learner in an image-to-image translation context.
However, when applied to a single domain MUNIT does “learn to learn” (if you agree with the conclusion of the previous chapter) as it combines information from all classes to extract the transformation distribution. While it does not perform classification, the class information of an image is encoded in MUNIT’s content space. Since MUNIT is trained unsupervised, it is probably closer to a distance metric than an actual class label. We might thus classify single-domain MUNIT as an unsupervised, generative meta-learner. It performs meta-learning in a general sense of “discerning the general from the task-specific” using a related two-loop training approach with two sets of parameters and a similar loss function. Thus MUNIT is related to meta-learning methods.
The ICLR paper “Do Deep Networks Transfer Invariances Across Classes?”
Learning the meta-information of image-to-image translation and meta-learning might enable researchers to design better architectures in both domains.