TokenFormer is a fully attention-based neural network that unifies the computations of token-token and token-parameter
interactions by entirely employing the attention mechanism,
maximizes the flexibility of neural network.
It not only tokenizes data but also model parameters,
replacing the model concept with interaction flows between data and parameter tokens.
This makes it can handle a variable number of parameters, offers greater flexibility than traditional Transformers,
will further contribute to the development of FMs in the future as follows:
We introduce Tokenformer, a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network. The advantage makes it can handle a variable number of parameters, inherently enhances the model's scalability, facilitating progressively efficient scaling.
Hope that this architecture can offer greater flexibility than traditional Transformers, will further contribute to the development of foundation models, sparse inference (MoE), parameter efficient tuning, device-cloud collaboration, vision-language, model interpretability, and so on.
Although transformers excel across various domains, their scalability is limited by high computational overheads resulting from prescribed token-parameter
interactions (i.e., linear projections). As a result, scaling strategies that adjust architectural components(e.g., channel dimensions) typically require
retraining the entire model from the beginning, leading to inefficient use of computational resources.
To overcome this challenge, we propose TokenFormer, an architecture entirely based on attention mechanisms. The central innovation of TokenFormer is token-Parameter attention (Pattention) layer, which incorporates a set of trainable tokens functioning as model parameters and then employs cross-attention to manage interactions between input tokens and these parameter tokens.
Pattention. To implement our Pattention mechanism, we use inputs as query and introduce two sets of n learnable parameter tokens to represent the keys and values. The output from the scaled dot-product Pattention layer is computed as:
where Θ is a modified softmax operation for stable optimization of Pattention layer. The output Pattention scores are formulated as,
where A is the attention score, τ is the scale factor, and f is a non-linearity function, which in our formulation is set to the GeLU function.
This design improves gradient stability in our architecture and results in better performance compared to the standard softmax operation.
Overall Architecture. The above figure illustrates the architecture of Tokenformer.
We follow the standard transformer and replace all the linear projections with our Pattention layers.
By designing the architecture in this manner, we represent all fundamental components-including both
input data and model parameters—as tokens within the computational framework, thereby establishing a
fully attention-based neural network characterized by exceptional flexibility.
Consider an existing Tokenformer model equipped with a set of pre-trained key-value parameter tokens, we augment this set by appending new key-value parameter tokens.
This scaling scheme permits the integration of an arbitrary number of parameters without altering the input or output dimension.
Progressive Model Scaling. Our progressive scaling methodology employing Tokenformer achieves performance comparable to that of a Transformer model trained from scratch, while substantially reducing the training budget.
Language Modeling. The below table presents the performance of Tokenformer across various widely-recognized zero-shot downstream tasks. Comparisons are drawn against leading open-source transformer models of equivalent scale. As shown in this table, our model achieves competitive performance compared to the standard Transformer, demonstrating the potential of our architecture in terms of expressive power as a foundation model.
Vision Modeling. The below table validates the expressiveness of our model in visual tasks. We compare our approach against the standard Vision Transformer (ViT) trained with supervised learning on the ImageNet-1K dataset. As shown in the table, our model achieves the same performance as ViT in visual modeling, confirming its expressiveness in visual tasks.
This paper introduces Tokenformer, a naturally scalable architecture that leverages the attention mechanism to facilitate not only inter-token computations but also interactions between tokens and model parameters, thereby enhancing architectural flexibility. By representing model parameters as tokens, we replace all linear projection layers in the Transformer with our Pattention layers, allowing for seamless and efficient incremental scaling without the need for retraining from scratch. We believe that this architecture, offering greater flexibility than traditional Transformers, will further contribute to the development of foundation models.
@article{wang2024tokenformer,
title={{TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters}},
author={Wang, Haiyang and Yue, Fan and Naeem, Muhammad Ferjad and Xian, Yongqin and Lenssen, Jan Eric and Wang, Liwei and Tombari, Federico and Schiele, Bernt},
journal={arXiv preprint arXiv:2410.23168},
year={2024}
}