TokenFormer

Rethinking Transformer Scaling with Tokenized Model Parameters

TokenFormer is a fully attention-based neural network that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network. 

It not only tokenizes data but also model parameters, replacing the model concept with interaction flows between data and parameter tokens. This makes it can handle a variable number of parameters, offers greater flexibility than traditional Transformers, will further contribute to the development of FMs in the future as follows:

Visual Representation Icon
Incremental Model Scaling: Our model allows for progressive scaling without retraining from scratch by adding new key-value parameter pairs, greatly reduces training costs. This advantage is the primary focus in this paper.
Connector Design Icon
Device-Cloud Collaboration: It can serve as the cloud knowledge base in Device-Cloud Collaboration, with each pair of key-value parameter tokens representing a learnable pattern.
Instruction Tuning Data Icon
Sparse Inference (MoE): We interpret Tokenformer as an extreme Mixture of Experts (MoE), with each key-value pair acting as an expert, greatly reduces inference costs.
Instruction Tuning Recipes Icon
Parameter Efficient Tuning: When confronted with new tasks or datasets, the model can augment its pre-trained parameters by incorporating these new parameter tokens, thereby adapting to specific task requirements quickly.
Benchmarking Icon
Integrating Vision and Language Models.: Enabling seamless integration of visual-language by merging key-value parameter tokens from pre-trained visual and language models.
Benchmarking Icon
Model Interpretability: It’s attention-based design naturally enhances model interpretability.
Teaser Image Teaser Image

We introduce Tokenformer, a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network. The advantage makes it can handle a variable number of parameters, inherently enhances the model's scalability, facilitating progressively efficient scaling.

Hope that this architecture can offer greater flexibility than traditional Transformers, will further contribute to the development of foundation models, sparse inference (MoE), parameter efficient tuning, device-cloud collaboration, vision-language, model interpretability, and so on.


Token-Parameter Attention

Although transformers excel across various domains, their scalability is limited by high computational overheads resulting from prescribed token-parameter interactions (i.e., linear projections). As a result, scaling strategies that adjust architectural components(e.g., channel dimensions) typically require retraining the entire model from the beginning, leading to inefficient use of computational resources.

To overcome this challenge, we propose TokenFormer, an architecture entirely based on attention mechanisms. The central innovation of TokenFormer is token-Parameter attention (Pattention) layer, which incorporates a set of trainable tokens functioning as model parameters and then employs cross-attention to manage interactions between input tokens and these parameter tokens.

Token-Parameter Attention
Figure 1: TokenFormer is a fully attention-driven architecture featuring a new token-parameter attention (Pattention) layer. The Pattention uses a set of learnable tokens to represent model parameters and lets the input tokens attend to them.

Pattention. To implement our Pattention mechanism, we use inputs as query and introduce two sets of n learnable parameter tokens to represent the keys and values. The output from the scaled dot-product Pattention layer is computed as:

pattention function

where Θ is a modified softmax operation for stable optimization of Pattention layer. The output Pattention scores are formulated as,

modified softmax function

where A is the attention score, τ is the scale factor, and f is a non-linearity function, which in our formulation is set to the GeLU function. This design improves gradient stability in our architecture and results in better performance compared to the standard softmax operation.

Overall Architecture. The above figure illustrates the architecture of Tokenformer. We follow the standard transformer and replace all the linear projections with our Pattention layers. By designing the architecture in this manner, we represent all fundamental components-including both input data and model parameters—as tokens within the computational framework, thereby establishing a fully attention-based neural network characterized by exceptional flexibility.

Progressive model scaling

Consider an existing Tokenformer model equipped with a set of pre-trained key-value parameter tokens, we augment this set by appending new key-value parameter tokens.

Progressive model scaling
Figure 2: As the model scales, TokenFormer adds new learnable tokens to expand the existing key-value parameter sets, while keeping the feature dimension constant and leaving the rest of the computation unaffected.

This scaling scheme permits the integration of an arbitrary number of parameters without altering the input or output dimension.


Experiments

Progressive Model Scaling. Our progressive scaling methodology employing Tokenformer achieves performance comparable to that of a Transformer model trained from scratch, while substantially reducing the training budget.

Progressive model scaling
Figure 3: Left: Evaluating model scaling costs through cumulative computational budgets. The Transformer baseline incurs expenses for each individual scaling step performed independently from scratch, whereas Tokenformer aggregates costs across all scaling stages, including training a 124M model initially, progressively scaling to 354M, 757M, and 1.4B parameters. Right: Evaluating model scaling costs by measuring the budget required at each scaling stage. The Transformer baselines used are consistent with those depicted in Figure 3, trained with 30B and 300B tokens. Similarly, for Tokenformer, the cost is the budget required for each incremental scaling step from a smaller one. All the experiments were conducted on TPU v4 hardware.

Language Modeling. The below table presents the performance of Tokenformer across various widely-recognized zero-shot downstream tasks. Comparisons are drawn against leading open-source transformer models of equivalent scale. As shown in this table, our model achieves competitive performance compared to the standard Transformer, demonstrating the potential of our architecture in terms of expressive power as a foundation model.

Progressive model scaling
Table 1: Zero-shot Evaluation: The best performance for each model size is highlighted in bold. Our comparisons are made with publicly available transformer-based LMs with various tokenizers. Following Pythia, our model is trained for up to 300B tokens on pile dataset.

Vision Modeling. The below table validates the expressiveness of our model in visual tasks. We compare our approach against the standard Vision Transformer (ViT) trained with supervised learning on the ImageNet-1K dataset. As shown in the table, our model achieves the same performance as ViT in visual modeling, confirming its expressiveness in visual tasks.

Progressive model scaling
Table 2: ImageNet-1K Classification: Comparison of standard vision transformer on ImageNet-1K. The training hyperparameters are completely consistent (batch size, learning rate, etc.) with MAE. † denotes models where the parameter size has been matched to that of the standard ViT.

Conclusion

This paper introduces Tokenformer, a naturally scalable architecture that leverages the attention mechanism to facilitate not only inter-token computations but also interactions between tokens and model parameters, thereby enhancing architectural flexibility. By representing model parameters as tokens, we replace all linear projection layers in the Transformer with our Pattention layers, allowing for seamless and efficient incremental scaling without the need for retraining from scratch. We believe that this architecture, offering greater flexibility than traditional Transformers, will further contribute to the development of foundation models.

BibTeX

@article{wang2024tokenformer,
  title={{TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters}},
  author={Wang, Haiyang and Yue, Fan and Naeem, Muhammad Ferjad and Xian, Yongqin and Lenssen, Jan Eric and Wang, Liwei and Tombari, Federico and Schiele, Bernt},
  journal={arXiv preprint arXiv:2410.23168},
  year={2024}
}