Exploring T5 Model : Text to Text Transfer Transformer Model

Written by prakhar21 | Published 2020/05/23
Tech Story Tags: t5 | machine-learning | transformers | text-transformers | pytorch | deep-learning | nlp | applications-of-nlp

TLDR Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. Recent years have seen a plethora of pre-trained models such as ULMFiT, BERT, GPT, etc being open-sourced to the NLP community. Given the size of such humungous models, it's nearly impossible to train such networks from scratch considering the amount of data and computation that is required. This is where a new learning paradigm "Transfer Learning" kicks in.via the TL;DR App

Recent years have seen a plethora of pre-trained models such as ULMFiT, BERT, GPT, etc being open-sourced to the NLP community. Given the size of such humungous models, it's nearly impossible to train such networks from scratch considering the amount of data and computation that is required. This is where a new learning paradigm "Transfer Learning" kicks in. Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem.
The idea is to use pre-trained network weights and fine-tunes it for some specific task at hand. The fact we wish to utilize the network weights requires it to initially train on a very large high-quality corpus for learning language structure, grammar, and semantics. Most of the existing models such as ULMFiT, GPT were pre-trained with the Language Model objective on Wikipedia and Google News dataset. Whereas, BERT, on the other hand, was trained with MLM (Masked Language Model) objective.
Later in this post, we will see what MLM is and how T5 is also trained with a similar objective with little tweaks for generalizability. Just to make sure everyone is on same page, a Language Model is a Machine Learning model that looks at historical parts of sentence and predicts the next word in the sentence.
T5: Text-to-Text-Transfer-Transformer model proposes reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings. This formatting makes one T5 model fit for multiple tasks. As can be seen in the featured animation that it takes in text input from left for various NLP tasks and outputs the text for that respective task. We will see more about how the model was trained and all in the below sections.
Before that, I wanted to discuss the data that was used to pre-train the model. The authors have named it C4 (Colossal Clean Crawled Corpus). It's approximately 700GB in size and is the cleaned version Common Crawl dataset. The authors have mentioned the cleaning in sense of extracting only English text, removing code lines, deduplicating, etc. It's a high quality pre-processed English language corpus that they have made available for download.
Also, the T5 model, pre-trained on C4, achieves state-of-the-art results on many NLP benchmarks while being flexible enough to be fine-tuned to a variety of important downstream tasks.

Training Objective

T5 also trains with the same objective as that of BERT's which is the Masked Language Model with a little modification to it. Masked Language Models are Bidirectional models, at any time t the representation of the word is derived from both left and the right context of it. The subtle difference that T5 employs is to replace multiple consecutive tokens with a single Mask keyword, unlike, BERT that uses Mask token for each word.
As you can see from the above diagram, the Original text is transformed into Input and Output pairs by adding perturbations to it. Since the final objective is to have trained a model that inputs text and outputs text, the targets were designed to produce a sequence, unlike BERT, that tries to output one word (itself) through final feed-forward and softmax at the output level.
The model was trained on the C4 corpus (mentioned above) with the same objective as a part of the pre-training. It was then finetuned on various tasks such as Language Translation, Summarization, Sentence Similarity, etc. Fine-tuning was done by showing model the I/O text pairs with task-specific prefix-text added to each input. For example - translate English to German: <text>, adding such a prefix enabled the model to tune it's weight for a particular task in-hand and would only produce the expected output for that task alone by narrowing its scope of generation. 
All the tasks essentially share the same objective, training procedure, and decoding process. The authors also claim that they did not find any single case where the model got confused and outputted something totally random or expected output of another task. One quiet interesting thing was that they even modeled regression tasks such as sentence similarity also as a text generation objective.
To reduce the scope of real numbers, they generated a number between 0 and 5 with 0.2 quantization, which means, the model could only produce numbers at 0.2 difference, for example - 3.2, 3.4, 3.6, etc. Training level specifics such as LR schedule, tokenization, sequence length, etc can be read in detail under the 3.1.2. Training section. 
The authors conducted extensive hyper-parameter tuning and testing across various tasks. The below diagram shows the tuning at various levels -
1. Pre-training Style - They tried typical auto-regressive style
language modeling objective, BERT style Masked Language Model
objective, and Deshuffling denoising objective. They found BERT style
(missing context prediction) as the best bet for pre-training the model.
2. Corruption Scheme - They experimented with 3 types of corruption strategy, Masking a random word, Masking a span (more than 1 consecutive words), and dropping a word from the input. Considering the task type in hand, which is, both I/O are text strings, Corrupting the span worked best for them.
3. Corruption Rate - After playing around with different corruption rates they found all of them to be performing almost the same, of which 15% was a little better.
4. Corruption Length - They also experimented with the different corruption span lengths and found that the more the span length the worse the model performed, also which seems to be true, considering span length equal to the length of sentence would mean model to produce the text from empty input giving it the flexibility to have high variability.
I would also encourage the reader to read Page 32 : Reflection to understand takeaways while training the model.

Demo

This section will focus on doing inference on the pre-trained T5 model. All the code has been committed to Github: Text-to-Text-Transfer-Transformer. Feel free to clone and play around. Also, don't forget to star the repo, in case you liked it.

References
  1. Google AI Blog - T5
  2. https://arxiv.org/pdf/1910.10683.pdf 
Thanks for your time.

Written by prakhar21 | NLP | AI | RL
Published by HackerNoon on 2020/05/23