T5 (language model)

Text-to-Text Transfer Transformer (T5)
Original author(s)	Google AI
Initial release	23 October 2019; 4 years ago
Stable release	T5X github.com/google-research/t5x
Repository	https://github.com/google-research/text-to-text-transfer-transformer
Type	Large language model; Transformer (deep learning architecture);
License	Apache-2.0
Website	blog.research.google/2020/02/exploring-transfer-learning-with-t5.html

T5 (Text-to-Text Transfer Transformer) is a series of large language models developed by Google AI. Introduced in 2019,^[1] T5 models are trained on a massive dataset of text and code using a text-to-text framework. The T5 models are capable of performing the text-based tasks that they were pretrained for. They can also be finetuned to perform other tasks.They have been employed in various applications, including chatbots, machine translation systems, text summarization tools, code generation, and robotics.

Like the original Transformer model,^[2] T5 models are encoder-decoder Transformers, where the encoder processes the input text, and the decoder generates the output text.

In 2024, T5X was updated to Pile-T5 by training the same architecture on an improved dataset (The Pile).^[3]

Training

T5 models are pre-trained on the Colossal Clean Crawled Corpus (C4), containing text and code scraped from the internet. This pre-training process enables the models to learn general language understanding and generation abilities. T5 models can then be fine-tuned on specific downstream tasks, adapting their knowledge to perform well in various applications.

The T5 models were pretrained on many tasks, all in the format of <input text> -> <output text>.

Some examples are:

restoring corrupted text: Thank you <X> me to your party <Y> week. -> <X> for inviting <Y> last <Z> where the <Z> means "end of output".
translation: translate English to German: That is good. -> Das ist gut..
judging the grammatical acceptability of a sentence (CoLA sentence): The course is jumping well. -> not acceptable .

Architecture

The T5 series encompasses several models with varying sizes and capabilities. These models are often distinguished by their parameter count, which indicates the complexity and potential capacity of the model. The original paper^[1] reported the following 5 models:

Model	Parameters	# layers	$d_{model}$	$d_{ff}$	$d_{kv}$	# heads
Small	60M	6	512	2048	64	8
Base	220M	12	768	3072	64	12
Large	770M	24	1024	4096	64	16
3B (XL)	3B	24	1024	16384	128	32
11B (XXL)	11B	24	1024	65536	128	128

In the above table,

# layers: Number of layers in the encoder; also, number of layers in the decoder. They always have the same number of layers.
# heads: Number of attention heads in each attention block.
$d_{model}$ : Dimension of the embedding vectors.
$d_{ff}$ : Dimension of the feedforward network within each encoder and decoder layer.
$d_{kv}$ : Dimension of the key and value vectors used in the self-attention mechanism.

Variants

Several subsequent models used the T5 architecture, with non-standardized naming conventions used to differentiate them. This section attempts to collect the main ones. An exhaustive list of the variants released by Google Brain is on the GitHub repo for T5X.^[4]

T5 small, base, large, XL, XXL (2019): The original models.^[1] Note that "XL" and "XXL" were renamed from "3B" and "11B" used in the original paper.^[4]
Switch Transformer (2021): a mixture-of-experts variant of T5, by replacing the feedforward layers in the encoder and decoder blocks with mixture of expert feedforward layers.^[5]^[6]
T0 (2021): a model based on T5 trained to perform tasks based only on task instruction (zero-shot).^[7]
Flan-T5-XL (2022): T5 XL but instruction-tuned on the FLAN dataset.^[8]^[9]^[10]^[11]
T5X (2022): an improved JAX-based implementation of the T5 codebase. It is not a model.^[12]
UL2 20B (2022): an encoder-decoder model based on the T5 model, but trained with "mixture of denoisers" objective on the Colossal Clean Crawled Corpus (C4).^[13]
Flan-UL2 20B (2022): UL2 20B but instruction-tuned on the FLAN dataset.^[13]^[10]
Pile-T5 (2024): T5 with the Llama tokenizer trained on The Pile. It came in sizes of base, large, XL, XXL.^[3]

References

^ ^a ^b ^c Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. ISSN 1533-7928.
^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
^ ^a ^b Sutawika, Lintang; Komatsuzaki, Aran; Raffel, Colin (2024-04-15). "Pile-T5". EleutherAI Blog. Retrieved 2024-05-05.
^ ^a ^b "t5x/docs/models.md at main · google-research/t5x". GitHub. Retrieved 2024-08-05.
^ Fedus, William; Zoph, Barret; Shazeer, Noam (2022-06-16), Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, doi:10.48550/arXiv.2101.03961, retrieved 2024-08-05
^ "SwitchTransformers". huggingface.co. Retrieved 2024-08-05.
^ Sanh, Victor; Webson, Albert; Raffel, Colin; Bach, Stephen H.; Sutawika, Lintang; Alyafeai, Zaid; Chaffin, Antoine; Stiegler, Arnaud; Scao, Teven Le (2022-03-17), Multitask Prompted Training Enables Zero-Shot Task Generalization, doi:10.48550/arXiv.2110.08207, retrieved 2024-08-05
^ Chung, Hyung Won; Hou, Le; Longpre, Shayne; Zoph, Barret; Tay, Yi; Fedus, William; Li, Yunxuan; Wang, Xuezhi; Dehghani, Mostafa; Brahma, Siddhartha; Webson, Albert; Gu, Shixiang Shane; Dai, Zhuyun; Suzgun, Mirac; Chen, Xinyun (2024). "Scaling Instruction-Finetuned Language Models". Journal of Machine Learning Research. 25 (70): 1–53. ISSN 1533-7928.
^ Longpre, Shayne; Hou, Le; Vu, Tu; Webson, Albert; Chung, Hyung Won; Tay, Yi; Zhou, Denny; Le, Quoc V.; Zoph, Barret; Wei, Jason; Roberts, Adam (2023-07-03). "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning". Proceedings of the 40th International Conference on Machine Learning. PMLR: 22631–22648.
^ ^a ^b google-research/FLAN, Google Research, 2024-08-03, retrieved 2024-08-05
^ "google/flan-t5-xl · Hugging Face". huggingface.co. 2024-01-04. Retrieved 2024-08-05.
^ Roberts, Adam; Chung, Hyung Won; Mishra, Gaurav; Levskaya, Anselm; Bradbury, James; Andor, Daniel; Narang, Sharan; Lester, Brian; Gaffney, Colin; Mohiuddin, Afroz; Hawthorne, Curtis; Lewkowycz, Aitor; Salcianu, Alex; Zee, Marc van; Austin, Jacob (2023). "Scaling Up Models and Data with t5x and seqio". Journal of Machine Learning Research. 24 (377): 1–8. ISSN 1533-7928.
^ ^a ^b Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28), UL2: Unifying Language Learning Paradigms, doi:10.48550/arXiv.2205.05131, retrieved 2024-08-05

[:0-1] Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67. ISSN 1533-7928.

[2] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.

[:4-3] Sutawika, Lintang; Komatsuzaki, Aran; Raffel, Colin (2024-04-15). "Pile-T5". EleutherAI Blog. Retrieved 2024-05-05.

[:5-4] "t5x/docs/models.md at main · google-research/t5x". GitHub. Retrieved 2024-08-05.

[5] Fedus, William; Zoph, Barret; Shazeer, Noam (2022-06-16), Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, doi:10.48550/arXiv.2101.03961, retrieved 2024-08-05

[6] "SwitchTransformers". huggingface.co. Retrieved 2024-08-05.

[7] Sanh, Victor; Webson, Albert; Raffel, Colin; Bach, Stephen H.; Sutawika, Lintang; Alyafeai, Zaid; Chaffin, Antoine; Stiegler, Arnaud; Scao, Teven Le (2022-03-17), Multitask Prompted Training Enables Zero-Shot Task Generalization, doi:10.48550/arXiv.2110.08207, retrieved 2024-08-05

[8] Chung, Hyung Won; Hou, Le; Longpre, Shayne; Zoph, Barret; Tay, Yi; Fedus, William; Li, Yunxuan; Wang, Xuezhi; Dehghani, Mostafa; Brahma, Siddhartha; Webson, Albert; Gu, Shixiang Shane; Dai, Zhuyun; Suzgun, Mirac; Chen, Xinyun (2024). "Scaling Instruction-Finetuned Language Models". Journal of Machine Learning Research. 25 (70): 1–53. ISSN 1533-7928.

[9] Longpre, Shayne; Hou, Le; Vu, Tu; Webson, Albert; Chung, Hyung Won; Tay, Yi; Zhou, Denny; Le, Quoc V.; Zoph, Barret; Wei, Jason; Roberts, Adam (2023-07-03). "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning". Proceedings of the 40th International Conference on Machine Learning. PMLR: 22631–22648.

[:2-10] google-research/FLAN, Google Research, 2024-08-03, retrieved 2024-08-05

[11] "google/flan-t5-xl · Hugging Face". huggingface.co. 2024-01-04. Retrieved 2024-08-05.

[:1-12] Roberts, Adam; Chung, Hyung Won; Mishra, Gaurav; Levskaya, Anselm; Bradbury, James; Andor, Daniel; Narang, Sharan; Lester, Brian; Gaffney, Colin; Mohiuddin, Afroz; Hawthorne, Curtis; Lewkowycz, Aitor; Salcianu, Alex; Zee, Marc van; Austin, Jacob (2023). "Scaling Up Models and Data with t5x and seqio". Journal of Machine Learning Research. 24 (377): 1–8. ISSN 1533-7928.

[:3-13] Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; Garcia, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Shakeri, Siamak; Bahri, Dara (2023-02-28), UL2: Unifying Language Learning Paradigms, doi:10.48550/arXiv.2205.05131, retrieved 2024-08-05

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]