March 14 was a huge day for the AI community. OpenAI released GPT-4, a multi-modal language model (MLLM) that has commonsense reasoning for both text and images while being able to operate with a context length of 32,000 tokens. Incredibly, GPT-4 was released less than one hour after Anthropic announced their own model, Claude. Claude is a text-only model with a context window of ~9,000 tokens.
Since GPT-4 can perceive images as well as text, it demonstrates impressive behavior such as visual question answering and image captioning. Having a longer context length (up from GPT-3’s 4,096[1]) is of major practical significance; a single prompt can cover hundreds of pages. This could be enough to contain a legal contract, a short story, or a company’s internal documents.
GPT-4 isn’t the first MLLM that has vision capabilities. In early March 2023, Microsoft released KOSMOS-1[2] which is trained on interleaved text and images. DeepMind trained a similar model, Flamingo[3], in April 2022. These models can engage in dialogue on images, image captioning, and visual question answering in a zero-shot manner, meaning they can solve problems they were not explicitly trained to solve.
This release follows several models from OpenAI that have been of interest to the ML community recently, including DALLE-2[4], Whisper[5], and ChatGPT. For those interested, we previously posted a deep-dive into Whisper and how it works.
OpenAI has not revealed information about the size, architecture, or training specifics of GPT-4*. Therefore, the purpose of this post is to present what is known and explain how models with similar capabilities (KOSMOS-1 and Flamingo) work.
We also draw conclusions from the GPT-4 Technical Report and the developer live stream. The rest of the blog is broken down as follows:
Language Modelling with Transformers: A Recap
A generative language model (LM) is trained to predict the next token** in a sequence, based on some tokens that have already been provided. Through this task, the model learns to use past context to build representations of language. The training process incentivizes the model to maximize the predicted probability of the “true” token, and to minimize the probabilities of the other tokens, using cross-entropy loss. Intuitively, a good LM would generate a high probability for the token “mat” when given the context “The cat sat on the”.
GPT style models are decoder-only transformers[6] which take in a sequence of tokens (in the form of token embeddings) and generate a sequence of output tokens, one at a time. Concretely, token embeddings are converted to a sequence of features that represent the input sequence. Each layer of the model refines this representation of the input using the features learnt from the previous layer. Finally, the features of the final layer are used to generate a sequence of output tokens.
The key innovation of the transformer architecture is the use of the self-attention mechanism. Self-attention allows the model to process all tokens in the input sequence in parallel, rather than sequentially and ‘attend to’ (or share information between) different positions in the sequence.
If you’d like to understand how this works in more detail, we recommend this amazing illustrated example.
Multi-Modal Language Modelling
GPT-4 is trained on both text and images. Its dataset is likely similar to that of KOSMOS-1[2], which is summarized in Table 1. GPT-3 was trained on text corpora totaling roughly 300 billion tokens. The implications of DeepMind’s Chinchilla LM showed that increasing the amount of data to 1.4 trillion tokens, as well as increasing parameter count, is necessary for improving performance. We speculate that OpenAI scaled up the dataset for GPT-4 to a similar size as used by Chinchilla, or more.
Table 1: The modalities, sources, and input structure of KOSMOS-1's training data.
For training, each modality must be converted to a representation in the same embedding space. In other words, we need a sequence of same-length vectors that are generated from text and images.
For text, this is straightforward since the tokens are already discretized. In KOSMOS-1, each token is assigned an embedding learned during training, the consequence being that words of similar semantic meaning become closer in the embedding space.
KOSMOS-1 deals with images using the MetaLM[7] approach. This provides a general-purpose interface supporting natural language interactions with other non-causal models. A pre-trained image encoder generates embeddings that are passed through a connector layer, which projects to the same dimension as the text embeddings. KOSMOS-1 can then handle image embeddings while predicting text tokens, as shown in Figure 1.
To do this, the model must learn the relationship between text and images. Each image consists of multiple embeddings (positional locations 7-12 in Figure 1) which are passed through the transformer. During training, only the embedding predicted after seeing all the image embeddings (e.g. x9 in Figure 1) is used to calculate the loss. When predicting this token, the transformer can still attend to all the image embeddings, thus allowing the model to learn a relationship between text and images.
Figure 1: Diagram of MetaLM: a general-purpose interface used by KOSMOS-1 to enable visual language modelling. KOSMOS-1 only outputs text tokens, so predictions corresponding to tokens at the same position of intermediate pre-trained model embeddings are ignored during training (e.g. positions 7-11). At position 12, the multimodal input finishes and we then use token prediction x9. (image source)
The architecture used for the image encoder is a pre-trained Vision Transformer (ViT)[8] . This is common for image processing tasks. The ViT applies a series of convolutional layers to an image to generate a set of “patches”, as shown in Figure 2. These image patches are flattened and transformed into a sequence of tokens, which are processed by the transformer to produce an output embedding. The ViT is encoder-only.
The approach used to train the ViT is the Contrastive Language-Image Pre-Training (CLIP) task[9]. Roughly speaking, images and text share an embedding space, and the model is trained such that matching image-text pairs have a high cosine similarity.
During KOSMOS-1 training, the ViT parameters are frozen, except for the last layer. The exact model is CLIP ViT-L/14. GPT-4 may use this approach as well. Alternatively, it’s not unreasonable that with enough data, the image encoder can be trained from scratch.
Figure 2: Overview of the Vision Transformer (ViT) architecture used as the image encoder of KOSMOS-1. The image is split into patches and processed by the transformer. (image source)
Flamingo[3] uses a different approach to multimodal language modelling. This could be a more likely architecture for GPT-4 since it was released in April 2022, and OpenAI’s GPT-4 pre-training was completed in August.
Flamingo also relies on a pre-trained image encoder, but instead uses the generated embeddings in cross-attention layers that are interleaved in a pre-trained LM (Figure 3).
Figure 3: Flamingo’s text and image processing. Each image embedding is generated by a pre-trained image encoder. These are passed through a resampler that ensures a fixed-length representation. The embeddings are used in cross attention layers that are inserted into a pre-trained LM. (image source)
To learn more about vision language models, we recommend this HuggingFace blog.