One Wide Feedforward is All You Need

This paper was accepted at WMT conference at EMNLP.
The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work, we explore the role of FFN and find that despite, and find that despite taking up a significant fraction of the model’s parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by…Apple Machine Learning Research