Haiyang Wang^1,3 Yue Fan¹ Muhammad Ferjad Naeem² Yongqin Xian² Jan Eric Lenssen¹ Liwei Wang³ Federico Tombari² Bernt Schiele¹

¹Max Planck Institute for Informatics ²Google ³Peking University

We introduce Tokenformer, a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network. The advantage makes it can handle a variable number of parameters, inherently enhances the model's scalability, facilitating progressively efficient scaling.

Hope that this architecture can offer greater flexibility than traditional Transformers, will further contribute to the development of foundation models, sparse inference (MoE), parameter efficient tuning, device-cloud collaboration, vision-language, model interpretability, and so on.

Token-Parameter Attention

Although transformers excel across various domains, their scalability is limited by high computational overheads resulting from prescribed token-parameter interactions (i.e., linear projections). As a result, scaling strategies that adjust architectural components(e.g., channel dimensions) typically require retraining the entire model from the beginning, leading to inefficient use of computational resources.

To overcome this challenge, we propose TokenFormer, an architecture entirely based on attention mechanisms. The central innovation of TokenFormer is token-Parameter attention (Pattention) layer, which incorporates a set of trainable tokens functioning as model parameters and then employs cross-attention to manage interactions between input tokens and these parameter tokens.

Pattention. To implement our Pattention mechanism, we use inputs as query and introduce two sets of n learnable parameter tokens to represent the keys and values. The output from the scaled dot-product Pattention layer is computed as:

where Θ is a modified softmax operation for stable optimization of Pattention layer. The output Pattention scores are formulated as,

where A is the attention score, τ is the scale factor, and f is a non-linearity function, which in our formulation is set to the GeLU function. This design improves gradient stability in our architecture and results in better performance compared to the standard softmax operation.

Overall Architecture. The above figure illustrates the architecture of Tokenformer. We follow the standard transformer and replace all the linear projections with our Pattention layers. By designing the architecture in this manner, we represent all fundamental components-including both input data and model parameters—as tokens within the computational framework, thereby establishing a fully attention-based neural network characterized by exceptional flexibility.

Progressive model scaling

Consider an existing Tokenformer model equipped with a set of pre-trained key-value parameter tokens, we augment this set by appending new key-value parameter tokens.

This scaling scheme permits the integration of an arbitrary number of parameters without altering the input or output dimension.

Experiments

Progressive Model Scaling. Our progressive scaling methodology employing Tokenformer achieves performance comparable to that of a Transformer model trained from scratch, while substantially reducing the training budget.

Language Modeling. The below table presents the performance of Tokenformer across various widely-recognized zero-shot downstream tasks. Comparisons are drawn against leading open-source transformer models of equivalent scale. As shown in this table, our model achieves competitive performance compared to the standard Transformer, demonstrating the potential of our architecture in terms of expressive power as a foundation model.

Vision Modeling. The below table validates the expressiveness of our model in visual tasks. We compare our approach against the standard Vision Transformer (ViT) trained with supervised learning on the ImageNet-1K dataset. As shown in the table, our model achieves the same performance as ViT in visual modeling, confirming its expressiveness in visual tasks.

Conclusion

This paper introduces Tokenformer, a naturally scalable architecture that leverages the attention mechanism to facilitate not only inter-token computations but also interactions between tokens and model parameters, thereby enhancing architectural flexibility. By representing model parameters as tokens, we replace all linear projection layers in the Transformer with our Pattention layers, allowing for seamless and efficient incremental scaling without the need for retraining from scratch. We believe that this architecture, offering greater flexibility than traditional Transformers, will further contribute to the development of foundation models.

TokenFormer

Rethinking Transformer Scaling with Tokenized Model Parameters

Token-Parameter Attention

Progressive model scaling

Experiments

Conclusion

BibTeX