ββExploring Transfer Learning with T5: the Text-To-Text Transfer Transformer
tl;dr:
- 11 billion parameters
- encoder-decoder models generally outperformed βdecoder-onlyβ language models
- fill-in-the-blank-style denoising objectives worked best;
- the most important factor was the computational cost;
- training on in-domain data can be beneficial but that pre-training on smaller datasets can lead to detrimental overfitting;
- multitask learning could be close to competitive with a pre-train-then-fine-tune approach but requires carefully choosing how often the model is trained on each task
The model can be fine-tuned on smaller labeled datasets, often resulting in (far) better performance than training on the labeled data alone.
Present a large-scale empirical survey to determine which transfer learning techniques work best and apply these insights at scale to create a new model that we call the T5. Also, introduce a new open-source pre-training dataset, called the Colossal Clean Crawled Corpus (C4).
The T5 model, pre-trained on C4, achieves SOTA results on many NLP benchmarks while being flexible enough to be fine-tuned to a variety of important downstream tasks.
blog post: https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
paper: https://arxiv.org/abs/1910.10683
github (with pre-trained models): https://github.com/google-research/text-to-text-transfer-transformer
colab notebook: https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb
#nlp #transformer #t5
tl;dr:
- 11 billion parameters
- encoder-decoder models generally outperformed βdecoder-onlyβ language models
- fill-in-the-blank-style denoising objectives worked best;
- the most important factor was the computational cost;
- training on in-domain data can be beneficial but that pre-training on smaller datasets can lead to detrimental overfitting;
- multitask learning could be close to competitive with a pre-train-then-fine-tune approach but requires carefully choosing how often the model is trained on each task
The model can be fine-tuned on smaller labeled datasets, often resulting in (far) better performance than training on the labeled data alone.
Present a large-scale empirical survey to determine which transfer learning techniques work best and apply these insights at scale to create a new model that we call the T5. Also, introduce a new open-source pre-training dataset, called the Colossal Clean Crawled Corpus (C4).
The T5 model, pre-trained on C4, achieves SOTA results on many NLP benchmarks while being flexible enough to be fine-tuned to a variety of important downstream tasks.
blog post: https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
paper: https://arxiv.org/abs/1910.10683
github (with pre-trained models): https://github.com/google-research/text-to-text-transfer-transformer
colab notebook: https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb
#nlp #transformer #t5