Language Models are Unsupervised Multitask Learners

Utilizing GPT-2, this paper matched serveral state of the art models across various benchmark without using custom architecture or dataset adjustments, solely through ample model capacity and data.

Original Paper: https://openai.com/research/better-language-models (opens in a new tab)

Introduction

Language Models can solve specialized language tasks without the need for task-specific architectures or fine-tuning given enough parameters and pre-training datasets. By having large and high quality text data, these models can tackle challenges such as question answering, translation and summarization. The more diverse a dataset is, the more likely it contains examples of the specialized task that the model can learn from.

For instance, A diverse dataset would contain examples of sequential and natural occuring translation tasks such as The translation of the french sentence 'As-tu aller au cinéma?' to english is 'Did you go to the cinema?'. This allows the model to generalize well to these tasks without task-specific supervised fine-tuning.

This paper uses GPT-2 to illustrate this point. GPT-2 is trained using a large corpus of text data where it tries to predict the next token in the sequence. This is a self-supervised (opens in a new tab) process since it is done without the use of manual labels by human annotators. During inference, GPT-2 generates text sequentially from an initial prompt with each predicted token being fed into the model again to predict future tokens. As a result, it is known as an autoregressive (opens in a new tab), model.

Ultimately, GPT-2's improvements over GPT-1 were attributed to a larger parameter count, a new unlabelled corpus WebText and the removal of task-specific fine-tuning.

From GPT-1 to GPT-2

In Natural Language Understanding (opens in a new tab) (NLU), there are a wide range of tasks, such as textual entailment, question answering, semantic similarity assessment, and document classification. These tasks are inherently labeled, but given the scarcity of such data, it makes discriminative (opens in a new tab) models such as Bidirectional Long Short-Term Memory (Bi-LSTM) underperform¹, leading to poor performance on these tasks.

In the GPT-1 paper Improving Language Understanding by Generative Pre-Training (opens in a new tab), the authors demonstrated that generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task, can overcome the constraints of the small amount of annotated data for these specific tasks. The process is collectively termed as semi-supervised learning (opens in a new tab) and the goal is to learn an universal representation of the natural language space that can be used across a wide range of tasks.

The pretraining objective is to predict the next token in a sequence, in an autoregressive manner, given the previous tokens. The pretrained model, often known as the foundational model (or backbone), serves as a base from which specialized capabilities can be added through fine-tuning on specific tasks. In the fine-tuning phase, task-specific adaptations are necessary: the input format must be adjusted to align with the particular requirements of the task at hand, and the model's final layer—or "head"—needs to be replaced to accommodate the task's specific class structure. The author showed that this approach yielded state-of-the-art results on a wide range of NLU tasks.

Notwithstanding the success of this approach, the same set of authors came up with a new paper in the following year, titled Language Models are Unsupervised Multitask Learners (opens in a new tab), where they introduced a new model, GPT-2, that was larger in model capacity, and trained on a much larger unlabeled corpus, WebText. However, the key innovation was to void the supervised fine-tuning step, and instead, they demonstrated that GPT-2 could be used directly on a wide range of NLU tasks directly, with what they termed as the zero-shot transfer. The motivation is that the authors think that foundational language models should be competent generalists, rather than narrowly experts². They call for the need to shift the language model paradigm to one that is generic enough to handle NLU tasks without the need to curate specific training data for each specific task.

GPT-2 Paper Key Ideas

In this section, we would review the key ideas from the GPT-2 paper.

Abstract Overview

Below are the key ideas from the abstract of the GPT-2 paper:

All previous pretrained language models necessitated a secondary stage of supervised fine-tuning to tailor them to specific downstream tasks.
The authors showcased that, given sufficient model capacity and data, language models can be adeptly adjusted to a broad spectrum of tasks without the need for task-specific architectural modifications.
When tasked with a question-answering challenge, specifically conditioned on a document and questions using the CoQA dataset (opens in a new tab) — comprised of over 127,700 training examples — the model demonstrates the capability to match or surpass the performance of three baseline models.
An emphasis is placed on the model's capacity as being integral to the success of zero-shot transfer. It's highlighted that the model's performance escalates in a log-linear fashion relative to the number of parameters, signifying that as the model's capacity increases logarithmically, its performance improves linearly.

Introduction

In this section, we would discuss the key ideas from the introduction of the GPT-2 paper.

Key 1. Competent Generalists over Narrow Experts (1)

The authors cited other works that have demonstrated significant success of machine learning systems through a combination of large-scale data, high model capacity, along with supervised fine-tuning.
However, such systems, termed as "narrow experts," are fragile, as they are highly dependent on the specific training regime and task. A slight perturbation to the input distribution can cause the model to perform poorly.
The authors then expressed the desire for "competent generalists" that can perform well across a wide range of tasks without the need for task-specific architectures or supervised fine-tuning.

Key 2. IID Assumption Fails in Real World (2, 3)

The overarching goal in machine learning is to generalize to unseen data points. To streamline the modeling of machine learning objectives, it's commonly assumed that the training and test data are drawn from the same distribution, a concept known as the Independent and Identically Distributed (i.i.d.) (opens in a new tab) assumption.
- As an aside, the i.i.d. assumption is foundational in statistical modeling because it simplifies the process significantly. For example, it allows us to express joint probability distributions (opens in a new tab) as the product of marginal distributions.
- Furthermore, evaluation techniques such as resampling and cross-validation with a holdout set rely on the assumption that the training and test data are drawn from the same distribution.
However, as the authors highlighted, the i.i.d. assumption fails in the real world. The distribution of the test data is often different from the training data, and the model's performance degrades significantly when the test data distribution is different from the training data distribution.
They attribute this to the prevalence of single task training on single domain datasets, which limits the model's ability to generalize across diverse conditions and tasks.

Further Readings:

Key 3. Multi-Task Learning is Nacent (4)

The author then underscored that multi-task learning represents a promising framework. By training a single model on multiple tasks simultaneously, the model is enabled to leverage generalizable latent space embeddings and representations to excel across various tasks.
It was further pointed out that recent work in the field utilizes, for example, 10 (dataset, objective) pairs³ to train a singular model (an approach known as meta-learning (opens in a new tab)). This implies that:
- Each dataset and its corresponding objective are unique.
- For instance, one dataset might focus on sentiment data, with the goal of predicting sentence sentiment, whereas another dataset might concentrate on named entity recognition, aiming to identify named entities within a sentence.
The challenge then circles back to the compilation, curation, and annotation of these datasets and objectives to ensure the model's generalizability. Essentially, this dilemma mirrors the initial issue of single-task training on single-domain datasets. The implication is that training a multi-task model might require an equivalent volume of curated data as training several single-task models. Furthermore, scalability becomes a concern when the focus is limited to merely 10 (dataset, objective) pairs.

Key 4. From Word Embeddings to Contextual Embeddings (5,6)

Initially, word embeddings such as Word2Vec and GloVe revolutionized the representation of words by mapping them into dense, fixed-dimensional vectors within a continuous $D$ dimensional space, hinging on the fact that words occuring in similar contexts/documents are similar semantically. These vectors were then used as input to a model to perform a specific task.
The next advancement is capturing more contextual information by using contextual embeddings, where the word embeddings are conditioned on the entire context of the sentence. Recurrent Neural Networks (opens in a new tab) (RNNs) is one example and the context embeddings can be "transferred" to other downstream tasks.

Specifically, unidirectional RNNs are adept at assimilating context from preceding elements, whereas bidirectional RNNs excel in integrating context from both preceding and succeeding elements. Nonetheless, both strategies grapple with challenges in encoding long-range dependencies.

Moreover, RNNs are notoriously plagued by the gradient vanishing problem (opens in a new tab), which means that the model is biased by the most recent tokens in the sequence, and the model’s performance degrades as the sequence length increases.
Self-attention mechanisms, foundational to the Transformer architecture, mark a paradigm shift by enabling each token to "attend" to every other token within a sequence concurrently.
- This allows the model to capture long-range dependencies and is the basis for the Transformer architecture. Consequently, self-attention is non-sequential by design and operates over a set of tokens, and not a sequence of tokens. This calls for the need to introduce positional encodings to the input embeddings to capture the sequential nature of the tokens.
- This advancement transcends the limitations of static word embeddings. Now, given two sentences, I went to the river bank versus i went to the bank to withdraw money, the word "bank" in the first sentence is semantically different from the word "bank" in the second sentence. The contextual embeddings can capture this difference.
The authors then went on to mention that the above methods would still require supervised fine-tuning to adapt to a specific task.

If there are minimal or no supervised data is available, there are other lines of work using language model to handle it - commonsense reasoning (Schwartz et al., 2017) and sentiment analysis (Radford et al., 2017).

Further Readings:

Key 5. Zero Shot Learning and Zero Shot Transfer (7)

Building upon the foundational concepts introduced previously, the authors explore the utilization of general methods of transfer to illustrate how language models can adeptly execute downstream tasks in a zero-shot manner, without necessitating any modifications to parameters or architecture.
Zero-shot learning (ZSL) is characterized by a model's capability to accurately execute tasks or recognize categories that it was not explicitly trained to handle. The crux of ZSL lies in its ability to generalize from known to unknown classes or tasks by harnessing side information or semantic relationships.
- For example, a model trained to recognize on a set of animals (including horses) but not on zebra, should be able to recognize a zebra as something close to horse, given the semantic relationship between the two animals.
Zero-shot transfer, often discussed within the context of transfer learning, involves applying a model trained on one set of tasks or domains to a completely new task or domain without any additional training. Here, the focus is on the transferability of learned features or knowledge across different but related tasks or domains. Zero-shot transfer extends the concept of transfer learning by not requiring any examples from the target domain during training, relying instead on the model's ability to generalize across different contexts based on its pre-existing knowledge.

Further Readings:

Section 2. Approach

In this section, we would discuss the key ideas from the approach section of the GPT-2 paper.

Key 1. Modeling Language Models over Joint Probability Distributions (1)

Language models strive to approximate the complex and inherently unknown distribution of the natural language space, denoted as $\mathcal{D}$ . In contrast to supervised learning, which explicitly separates inputs ( $\mathcal{X}$ ) from labels ( $\mathcal{Y}$ ), unsupervised learning — particularly when employing self-supervision as seen in language modeling — blurs this distinction. Here, $\mathcal{Y}$ is conceptually a shifted counterpart of $\mathcal{X}$ , facilitating a unified approach where $\mathcal{D}$ can be modeled exclusively over the space of $\mathcal{X}$ . This scenario allows us to frame $\mathcal{D}$ as a probability distribution across sequences of tokens within $\mathcal{X}$ , parameterized by $\boldsymbol{\Theta}$ .

In this context, the essence of language modeling is to characterize the joint probability distribution of sequences $\mathbf{x} = (x_1, x_2, \ldots, x_T)$ within $\mathcal{X}$ . The goal is to maximize the likelihood of observing these sequences in a corpus $\mathcal{S}$ , denoted as $\hat{\mathcal{L}}(\mathcal{S} ; \hat{\boldsymbol{\Theta}})$ , where $\hat{\boldsymbol{\Theta}}$ represents the estimated parameter space that approximates the true parameter space $\boldsymbol{\Theta}$ .

Key 2. Decompose Joint Distributions as Conditional Distributions via Chain Rule (2)

The joint probability of a sequence in natural language, inherently ordered², can be factorized into the product of conditional probabilities of each token in the sequence using the chain rule of probability (opens in a new tab). This approach not only enables tractable sampling from and estimation of the distribution $\mathbb{P}(\mathbf{x} ; \boldsymbol{\Theta})$ but also facilitates modeling conditionals in forms such as $\mathbb{P}(x_{t-k} \ldots x_t \mid x_1 \ldots x_{t-k-1} ; \boldsymbol{\Theta})$ ². Given a corpus $\mathcal{S}$ with $N$ sequences, the likelihood function $\hat{\mathcal{L}}(\mathcal{S} ; \hat{\boldsymbol{\Theta}})$ represents the likelihood of observing these sequences. The ultimate objective is to maximize this likelihood, effectively approximating the joint probability distribution through conditional probability distributions.

Key 3. Conditional on Task (3)

In the GPT-2 paper, Language Models are Unsupervised Multitask Learners, the authors introduced the concept of conditional on task where the GPT model $\mathcal{G}$ theoretically should not only learn the conditional probability distribution:

\mathbb{P}(x_t \mid x_{\< t} ; \boldsymbol{\Theta})

but also learn the conditional probability distribution:

\mathbb{P}(x_t \mid x_{\< t} ; \boldsymbol{\Theta}, \mathcal{T})

where $\mathcal{T}$ is the task that the model should implicitly learn². This is a powerful concept because if such a hypothesis is correct, then the GPT model $\mathcal{G}$ can indeed be a multi-task learner, and can be used directly on a wide range of NLU tasks without the need for supervised fine-tuning for downstream domain-specific tasks.

In practice, the authors mentioned that task conditioning is often implemented at an architectural level, via task specific encoder and decoder in the paper One Model To Learn Them All (opens in a new tab)⁴, for instance, or at an algorithmic level, such as the inner and outer loop optimization framework, as seen in the paper Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (opens in a new tab)⁵.

However, the authors further mentioned that without task-specific architectural changes, one can leverage the sequential nature of the natural language space where we can construct a tasks, inputs and outputs all as a sequence of symbols². For example, a translation task can be formulated as a sequence of symbols via (translate to french, english sequence, french sequence), where the model can now learn to also condition on the task (translate to french) in addition to the sequence of tokens. The paper The Natural Language Decathlon: Multitask Learning as Question Answering exemplifies this concept with their model Multitask Question Answering Network (MQAN), where a single model is trained to perform many diverse natural language processing tasks simultaneously.

Key 4. Optimizing Unsupervised is the same as Optimizing Supervised (4)

The GPT-2 paper Language Models are Unsupervised Multitask Learners demonstrated that they want to do away with the supervised fine-tuning phase via an interesting hypothesis, that optimizing the unsupervised objective is the same as optimizing the supervised objective because the global minimum of the unsupervised objective is the same as the global minimum of the supervised objective².

Key 5. Large Language Models has Capacity to Infer and Generalize (5)

In what follows, the author added that the internet contains a vast amount of information that is passively available without the need for interactive communication. The example that I provided on the french-to-english translation would bound to exist naturally in the internet. They speculate that if the language model is large enough in terms of capacity, then it should be able to learn to perform the tasks demonstrated in natural language sequences in order to better predict them, regardless of their method of procurement².

In the figure below, we can see examples of naturally occurring demonstrations of English to French and French to English translation found throughout the WebText training set.

Examples of naturally occurring demonstrations of English to French and French to English translation found throughout the WebText training set.

2.1. Training Dataset

Key 1. Rejection of CommonCrawl (1,2)

Prior research often focused on training language models on single-domain datasets, which relates to the concept of models becoming narrow experts.
To cultivate competent generalists, the authors contend that models need exposure to a diverse array of tasks and domains.
CommonCrawl, housing an expansive collection of web scrapes (essentially capturing the entirety of the internet), is recognized for its diversity.
Nevertheless, CommonCrawl was ultimately rejected by the authors due to significant data quality issues.

Key 2. Construction of WebText Dataset

The authors sought to compile a web scrape prioritizing document quality over quantity.
To attain a certain level of document quality without the exorbitant costs of manual curation, the authors employed a strategy of indirect human curation. This involved scraping all outbound links from Reddit that garnered a minimum of 3 karma. Karma, in this scenario, acts as a heuristic for content deemed interesting, educational, or entertaining by the Reddit community.
- Outbound links refer to instances where a Reddit post links out to external websites; the authors included the content from these external sites in their dataset, contingent on the originating post receiving at least 3 karma.
The resulting dataset, dubbed WebText, comprises text from approximately 45 million links.
Subsequent preprocessing efforts, including de-duplication, heuristic-based cleaning, and the exclusion of Wikipedia links, resulted in a dataset spanning about 40GB of text (8 million documents).
The snapshot of the dataset is December 2017.
Wikipedia's exclusion was deliberate, stemming from the authors' intention to minimize overlap with training sources prevalent in other studies. This decision aimed to facilitate more "authentic" evaluation/testing scenarios for their model by reducing data leakage.

2.2. Input Representation

Key 1. Byte Pair Encoding (BPE) (1,2,3)

Traditional tokenization methods often involve steps such as lower-casing, punctuation stripping, and splitting on whitespace. Additionally, these methods might encode out-of-vocabulary words using a special token to enable the model to handle unseen words during evaluation or testing phases. For instance, language models (LMs) may struggle with interpreting emojis due to such constraints.
These conventional approaches can inadvertently restrict the natural language input space $\mathcal{X}$ , consequently limiting the model space $\mathcal{H}$ . This limitation stems from the fact that the scope of $\mathcal{H}$ is inherently dependent on the comprehensiveness of $\mathcal{X}$ as we can see $\mathcal{H} = \mathcal{H}(\mathcal{X} ; \boldsymbol{\Theta})$ , which means that the model space $\mathcal{H}$ is a function of the input space $\mathcal{X}$ and the parameter space $\boldsymbol{\Theta}$ .
To resolve this, the idea of byte-level encoding can be used - since you theoretically can encode any character in the world in UTF-8 encoding.
However, the limitation is current byte-level language models tend to perform poorly on word level tasks.
The authors then introduced the BPE algorithm (is "byte-level" because it operates on UTF-8 encoded strings) where they striked a balance between character-level and word-level tokenization.
So in summary, BPE is the tokenizer used to encode the input text into a sequence of tokens - which form the input representation to the model.

2.3. Model

The GPT-2 architecture is a transformer-based model, and as the name suggests, it is a continuation of the GPT-1 model with some minor modifications.

Key 1. GPT-2 is a Continuation of GPT-1 with Self-Attention Mechanisms (1)

GPT-2 utilizes a Transformer architecture⁶ as its backbone, which is distinguished by self-attention mechanisms. This architecture empowers the model to capture complex dependencies and relationships within the data.

Key 2. Modifications from GPT-1 and Model Stability (1)

Modifications from GPT-1 include:
- Layer normalization is repositioned to the input of each sub-block, mirroring a pre-activation residual network. This modification is believed to offer training stability and model performance. By normalizing the inputs to each sub-block, it is conjectured to alleviate issues tied to internal covariate shift, thus aiding in smoother and potentially faster training.
- GPT-2 introduces an additional layer normalization step that is executed after the final self-attention block within the model. This additional normalization step can help ensure that the outputs of the transformer layers are normalized before being passed to subsequent layers or used in further processing, further contributing to model stability.
- The GPT-2 paper introduces a modification to the standard weight initialization for the model's residual layers. Specifically, the weights are scaled by a factor of $\frac{1}{\sqrt{N_{\text{decoderblocks}}}}$ , where $N_{\text{decoderblocks}}$ represents the number of blocks (or layers) in the Transformer's decoder.
  
  The rationale, as quoted from the paper: "A modified initialization which accounts for the accumulation on the residual path with model depth is used"², is to ensure that the variance of the input to the block is the same as the variance of the block's output. This is to ensure that the signal is neither amplified nor diminished as it passes through the block. As the model depth increases, the activations get added/acculumated, and hence the scaling factor is $\frac{1}{\sqrt{N_{\text{decoderblocks}}}}$ , to scale it down.
- Clearly, we can see the empahsis on model stability. In training large language models, numerical stability is paramount; the cost of training is significantly high, with every loss and gradient spike that fails to recover necessitating a return to a previous checkpoint, resulting in substantial GPU hours and potentially tens of thousands of dollars wasted.
- The model's vocabulary is expanded to 50,257 tokens.
- The context window size is increased from 512 to 1024 tokens, enhancing the model's ability to maintain coherence over longer text spans.
- A larger batch size of 512, GPT-2 benefits from more stable and effective gradient estimates during training, contributing to improved learning outcomes.

GPT-2 Variants

To this end, we encapsulate some key parameters in the table below, which provides specifications for several GPT-2 variants, distinguished by their scale.

Parameters	Layers	d_model	H	d_ff	Activation	Vocabulary Size	Context Window
117M	12	768	12	3072	GELU	50,257	1024
345M	24	1024	16	4096	GELU	50,257	1024
762M	36	1280	20	5120	GELU	50,257	1024
1542M	48	1600	25	6400	GELU	50,257	1024

See The Implementation of Generative Pre-trained Transformers (GPT) (opens in a new tab) for a more comprehensive walkthrough of the GPT-2 model architecture, annotated with code.

A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving Language Understanding by Generative Pre-Training" (opens in a new tab). ↩
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language Models are Unsupervised Multitask Learners" (opens in a new tab). ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. "The natural language decathlon: multitask learning as question answering" (opens in a new tab). 2018. arXiv:1806.08730. ↩
Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. "One model to learn them all" (opens in a new tab). 2017. arXiv:1706.05137. ↩
Chelsea Finn, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks" (opens in a new tab). 2017. ↩
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. "Attention is all you need" (opens in a new tab). In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017. ↩

Deepseek Moe Medusa

Language Models are Unsupervised Multitask Learners

Introduction

From GPT-1 to GPT-2

GPT-2 Paper Key Ideas

Abstract Overview

Introduction

Key 1. Competent Generalists over Narrow Experts (1)

Key 2. IID Assumption Fails in Real World (2, 3)

Key 3. Multi-Task Learning is Nacent (4)

Key 4. From Word Embeddings to Contextual Embeddings (5,6)

Key 5. Zero Shot Learning and Zero Shot Transfer (7)

Section 2. Approach

Key 1. Modeling Language Models over Joint Probability Distributions (1)

Key 2. Decompose Joint Distributions as Conditional Distributions via Chain Rule (2)

Key 3. Conditional on Task (3)

Key 4. Optimizing Unsupervised is the same as Optimizing Supervised (4)

Key 5. Large Language Models has Capacity to Infer and Generalize (5)

2.1. Training Dataset

Key 1. Rejection of CommonCrawl (1,2)

Key 2. Construction of WebText Dataset

2.2. Input Representation

Key 1. Byte Pair Encoding (BPE) (1,2,3)

2.3. Model

Key 1. GPT-2 is a Continuation of GPT-1 with Self-Attention Mechanisms (1)

Key 2. Modifications from GPT-1 and Model Stability (1)

GPT-2 Variants

Footnotes