HyperTransformer: D Dependence On Parameters and Ablation Studies

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Andrey Zhmoginov, Google Research & {azhmogin,sandler,mxv}@google.com;

(2) Mark Sandler, Google Research & {azhmogin,sandler,mxv}@google.com;

(3) Max Vladymyrov, Google Research & {azhmogin,sandler,mxv}@google.com.

Table of Links

D DEPENDENCE ON PARAMETERS AND ABLATION STUDIES

Most of our parameter explorations were conducted for OMNIGLOT dataset. We chose a 16-channel model trained on a 1-shot-20-way OMNIGLOT task as an example of a model, for which just the logits layer generation was sufficient. We also chose a 4-channel model trained on a 5-shot-20-way OMNIGLOT task for the role of a model, for which generation of all convolutional layers proved to be beneficial. Figures 5 and 6 show comparison of training and test accuracies on OMNIGLOT for different parameter values for these two models. Here we only used two independent runs for each parameter value, which did not allow us to sufficiently reduce the statistical error. Despite of this, in the following, we try to highlight a few notable parameter dependencies. Note here that in some experiments with particularly large feature or model sizes, training progressed beyond the target number of steps and there could also be overfitting for very large models.

Number of transformer layers. Increasing the number of transformer layers is seen to be particularly important in the 4-channel model. The 16-channel model also demonstrates the benefit of using 1 vs 2 transformer layers, but the performance appears to degrade when we use 3 transformer layers.

Local feature dimension. Particularly low local feature dimension can be seen to hurt the performance in both models, while using higher local feature dimension appears to be advantageous cases except for the 32-dimensional local feature in the 4-channel model.

Embedding dimension. Particularly low embedding dimension of 16 can be seen to hurt the performance of both models.

Number of transformer heads. Increasing the number of transformer heads leads to performance degradation in the 16-channel model, but does not have a pronounced effect in the 4-channel model.

Shared feature dimensions. Removing the shared feature, or using an 8-dimensional feature can be seen to hurt the performance in both cases of the 4- and 16-channel models.

Transformer architecture. While the majority of our experiments were conducted with a sequence of transformer encoder layers, we also experimented with an alternative weight generation approach, where both encoder and decoder transformer layers were employed (see Fig. 4). Our experiments with both architectures suggest that the role of the decoder is pronounced, but very different in two models: in the 16-channel model, the presence of the decoder increases the model performance, while in the 4-channel model, it leads to accuracy degradation.

Inner transformer embedding sizes. Varying the ν parameter for different components of the transformer model (key/query pair, value and inner fully-connect layer size), we quantify their importance on the model performance. Using very low ν for the value dimension hurts performance of both models. The effect of key/query and inner dimensions can be distinctly seen only in the 4-channel model, where using ν = 1 or ν = 1.5 appears to produce the best results.

Weight allocation approach. Our experiments with the “spatial” weight allocation in 4- and 16-channel models showed slightly inferior performance (both accuracies dropping by about 0.2% to 0.4% in both experiments) compared to that obtained with the “output” weight allocation method.