HyperTransformer: B Model Parameters

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com;

(2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com;

(3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com.

Table of Links

B MODEL PARAMETERS

Here we provide additional information about the model parameters used in our experiments.

Image augmentations and feature extractor parameters. For OMNIGLOT dataset, we used the same image augmentations that were originally proposed in MAML. For MINIIMAGENET and TIEREDIMAGENET datasets, however, we used ImageNet-style image augmentations including horizontal image flipping, random color augmentations and random image cropping. This helped us to avoid model overfitting on the MINIIMAGENET dataset and possibly on TIEREDIMAGENET.

The dimensionality d of the label encoding ξ and weight slice encoding µ was typically set to 32. Increasing d up to the maximum number of weight slices plus the number of per-episode labels, would allow the model to fully disentangle examples for different labels and different weight slices, but can also make the model train slower.

Transformer parameters. Since the weight tensors of each layer are generally different, our perlayer transformers were also different. The key, query and value dimensions of the transformer were chosen to be equal to a pre-defined fraction ν of the input embedding size, which in turn was a function of the label embedding and feature dimensionality as well as the size of the weight slices. The inner dimension of the final fully-connected layer in the transformer was also chosen using the same approach. In our MINIIMAGENET and TIEREDIMAGENET experiments, ν was chosen to be 0.5 and in OMNIGLOT experiments, we used ν = 1. Each transformer typically contained 2 or 3 encoder layers and used 2 or 8 heads for OMNIGLOT and MINIIMAGENET, TIEREDIMAGENET, correspondingly.

Learning schedule. In all our experiments, we used gradient descent optimizer with a learning rate in the 0.01 to 0.02 range. Our early experiments with more advanced optimizers were unstable. We used a learning rate decay schedule, in which we reduced the learning rate by a factor of 0.95 every 105 learning steps.