PET: Exploiting Patterns makes even small language models few shot learners

9 min readApr 18, 2021

For a while now, huge Transformer-Based Language Models (LMs) with billions of parameters pre-trained on huge unlabelled text corpora have performed really well on a variety of Natural Language Processing (NLP) tasks. GPT-3 took the internet by storm when it was able to do such a large variety of NLP downstream tasks, with zero fine-tuning, and with only a few labeled examples (32 samples are sufficient to reach SOTA results for most SuperGlue Benchmark Tasks!) given as a prompt, thus giving rise to a new field — prompt engineering. (NOTE: I am considering only GPT-3 as a representative of huge language models in this article).

However, there are five major drawbacks to GPT-3 -

Pre-training such huge transformer models requires training on lots and lots and lots of unlabelled data using large GPU Clusters! Scraping such volumes of data and having such huge compute power is not everyone's cup of tea. To quote an article from Lambda — “ GPT-3 175B model required 3.14E23 FLOPS of computing for training. Even at theoretical 28 TFLOPS for V100 and lowest 3 year reserved cloud pricing we could find, this will take 355 GPU-years and cost $4.6M for a single training run. Similarly, a single RTX 8000, assuming 15 TFLOPS, would take 665 years to run.” — GPT-3 clearly took a fortune to train!
Let’s say, by some magic, you have such resources and compute, and can train such huge models. The problem is such large compute power comes at the cost of releasing a lot of heat into the atmosphere. the carbon footprint is described by this article from TheRegister — “..teaching the neural super-network (referring to GPT-3) in a Microsoft data center using Nvidia GPUs required roughly 190,000 kWh, which using the average carbon intensity of America would have produced 85,000 kg of CO2 equivalents, the same amount produced by a new car in Europe driving 700,000 km, or 435,000 miles, which is about twice the distance between Earth and the Moon, some 480,000 miles.” Damn!
Let’s say you just want the LM only for inference. Such huge proprietary models are not just lying around for anyone to use, due to commercial reasons. Although, efforts have been made to make similar opensource models e.g. GPT-Neo, released by EleutherAI, released two GPT-2 and GPT-3 like models (with 1.3B and 2.7B parameters), and is presently working on scaling up the architecture to hundreds of billions of parameters in GPT-NeoX.
Since the examples are given as a text prompt, there is a limited number of examples that can be given (GPT-3 has a maximum sequence length is 2,048). This is because the maximum number of input tokens is pre-defined according to the architecture. Also, increasing the number of input tokens increases the complexity of self-attention by a lot, due to its quadratic dependence on the sequence length.
Deploying models trained on such large, unmonitored and unmoderated data could lead to a plethora of fairness and bias concerns. As this article describes the fairness studies conducted by OpenAI researchers on GPT-3, there are a lot of inherent gender-based, racial, and religion-based biases that creep in from the data it has been trained on.

Enough rant on GPT-3. It’s MEME TIME :)

Go for small but robust models! — Original Content made using: https://imgflip.com/memegenerator

Hence, keeping these drawbacks in mind, we would like to ease our way into using comparatively smaller pre-trained models like BERT, RoBERTa, ALBERT, and so on. However, these models aren’t as efficient as the huge GPT-3 model, when it comes to few-shot learning where the available task-specific data is scarce. For instance, if we just compare the GPT-3 with its smaller variants, we can see from the following figure that the few-shot performance increases with the number of training parameters of the model.

Variation of model accuracy with model size (taken from GPT-3 paper)

So, how can we maintain the sample efficiency of a huge GPT-3 model, without having to use such a large model pre-trained on huge data chunks? Well, the answer lies in exploiting patterns, as suggested in the following lineage of Papers — PET & iPET, their multi-token variants, and ADAPET. We shall have an overview of these papers in the coming sections.

PET & iPET

PET stands for “Pattern-Exploiting Training”. This term comes from the paper “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference” by Schick et.al. The motivation behind exploiting “patterns” for few-shot learning is explained by a very simple example in the paper (I will improvise on the example just to add a pinch of creativity :P ) — Let us consider three pieces of text:

T1: Damn, ’tis the best pizza in the town bro!

T2: Pay half the bucks, and you get much better sushi I tell ya

T3: Pizza was so so. Not worth the fat wad of cash!

If we are told next to nothing about the task details (the only thing known about the task is that, it is a binary classification problem), and are asked to predict the label for T3, given T1 has the label L and T2 has the label L`, we can use two sets of contradictory arguments to come to different conclusions about the label of T3. However, if we are told that, the task is about a sentence that tells something about prices (i.e. if we have a task description), we will be able to confidently assign label L` to T3. In other words, providing task description as an auxiliary training signal would help the model in predicting unknown test labels, even if the amount of labeled training data is scarce.

This paper uses “a semi-supervised training procedure that uses natural language patterns to reformulate input examples into cloze-style phrases.” (quoted from the paper). Breaking it down, it broadly does the following three things -

Each training sample from the training set T is converted to a variety of cloze questions (a single token is masked, and the model is expected to predict the original token that was masked) based on some patterns. These cloze questions are made in such a way so as to provide some sort of task description. For instance, if we consider the sentence entailment task, and consider the following input pair: (I like cheese, I hate cheese). One pattern could be “A? [MASK]B”, whose corresponding cloze question could be — “I like cheese? [MASK] I hate cheese”. The masked token is basically the label of the input (here 1/0) converted to some tokens from the dictionary (here, Yes/No), and this mapping is done using a function known as the verbalizer. For each such pattern, a pre-trained language model is fine-tuned (the number of training samples is limited). Thus, we get as many fine-tuned language models as there are patterns. Each language model is fine-tuned using a weighted sum of the cross-entropy loss while predicting the masked token, and a masked language modeling loss for masking some other token, and predicting that instead. This is done due to the less number of training samples.
Next, we take loads of unlabelled data D, with the samples being in the same format as in T (except for the absence of label information). We make an ensemble of the various fine-tuned models (normalized weighted sum of individual output scores), and use this to predict a probability distribution for each unlabelled sample across all labels. This is used for annotating unlabelled data.
Treating the ensemble model as a “teacher” model, a “student” classifier model is trained on the resulting soft-labeled dataset, i.e., the soft labeled data obtained from the previous step is used for training a classifier. This “knowledge distillation” approach helps us in utilizing various inferior, unstable fine-tuned LMs (whose ensemble was taken), in order to make robust predictions.

This process is succinctly demonstrated in the following diagram from the paper -

iPET is an iterative variant of PET, where the training set size keeps increasing with each iteration, where data labeled in previous iterations, is used for training the present iteration. This is similar to self-training and bootstrapping approaches, which intuitively, improves PET.

Drawbacks —

Can only mask one token for a pattern
Need a lot of task-specific unlabeled data

Multi-Token Variants

The verbalizer in the original version of PET maps each output to a single token. Added to this, in PET, output space is fixed across inputs. However, these limitations are dealt with in the improved version, where multiple tokens are predicted, in an auto-regressive fashion. Also, due to this multi-token situation, cross-entropy is not the proper way to go about it. Rather, the authors use multi-class hinge loss, thus minimizing -

From https://arxiv.org/pdf/2009.07118.pdf

This loss function tries to keep the separation between the log probabilities of the ground truth output y and any other output y` ≥ 1.

Drawback(s)-

Still need a lot of task-specific unlabeled data

ADAPET to the rescue!

To remove the drawback of using lots and lots of unlabeled task-specific data, (which to be honest is difficult to get), ADAPET (A Densely-supervised Approach to PET) was introduced. While PET and IPET used 9k additional unlabeled examples in addition to 32 labeled examples, ADAPET uses only 32 labeled examples, which is about 0.3% of the data used by PET/iPET. However, ADAPET performs better than iPET and PET.

The following improvements in the training schema put ADAPET ahead of its ancestors both w.r.t sample efficiency and performance —

Instead of just maximizing the probability of the right token (in MLM) as in PET and iPET, the authors also try to minimize the probability of incorrect classes, in order to differentiate between an incorrect class and any other vocabulary token. They sum up the binary cross-entropy function across all classes (1 correct and rest incorrect) and try to minimize this loss.
Random tokens are masked and are trained for prediction, conditioned on the label. If the label is correct, then training is performed, otherwise not.

The above two points are summarized in the following figure from the paper -

Finally, comparing the three models on their performance on SuperGlue, this figure from the paper sums it up (along with their comparison with GPT-3)

From https://arxiv.org/pdf/2103.11955.pdf

What is the way ahead?

The aforementioned papers have done tremendous work in bringing down the number of samples seen by the network during training, thus making few-shot learning using small models such as ALBERT possible (by small, I mean as compared to GPT-3 😛). However, zero-shot learning and domain adaptation using a very less number of training examples still remain largely unsolved. Also, if we are to take steps towards Artificial General Intelligence, mastery in zero-shot learning would come in handy. I look forward to seeing papers that solve these problems in the NLP Paradigm, keeping the training data size and the model size in check.

Acknowledgments

This was written as a part of the NLP Course in IIT Kharagpur. Making blog writing as a part of the extra credits really encouraged me to explore research papers and understand few selected ones in a good amount of depth.

This article would not have been possible without the youtube video (click here), which explains the concepts in the aforementioned papers in such a simple manner.

References

Schick, T., & Schütze, H. (2020). Exploiting cloze questions for few-shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. https://arxiv.org/abs/2001.07676
Schick, T., & Schütze, H. (2020). It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. arXiv preprint arXiv:2009.07118. https://arxiv.org/abs/2009.07118
Tam, D., Menon, R. R., Bansal, M., Srivastava, S., & Raffel, C. (2021). Improving and Simplifying Pattern Exploiting Training. arXiv preprint arXiv:2103.11955. https://arxiv.org/abs/2103.11955
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165. https://arxiv.org/abs/2005.14165