Improving Transformer Models by Reordering their Sublayers
tl;dr – improve transformers by reordering their sublayers like the sandwich transformer
The authors trained random #transformer models with reordered sublayers, and find that some perform better than the baseline interleaved trans former in #language #modeling.
They observed that, on average, better models contain more self-attention #sublayers at the bottom and more feedforward sublayer at the top.
This leads them to design a new transformer stack, the sandwich transformer, which consistently improves performance over the baseline at no cost.
paper: https://ofir.io/sandwich_transformer.pdf
tl;dr – improve transformers by reordering their sublayers like the sandwich transformer
The authors trained random #transformer models with reordered sublayers, and find that some perform better than the baseline interleaved trans former in #language #modeling.
They observed that, on average, better models contain more self-attention #sublayers at the bottom and more feedforward sublayer at the top.
This leads them to design a new transformer stack, the sandwich transformer, which consistently improves performance over the baseline at no cost.
paper: https://ofir.io/sandwich_transformer.pdf