TIL: Left vs Right Padding

TL;DR You should use left padding for autoregressive LLMs (GPT2), but right padding for encoder LLMs (BERT)

Why do we need padding?

Language models expect a tensor of shape (batch_size, sequence_length), therefore each sequence in the batch must be of the same length.

However not all input texts are of the same size (obviously :))

Let us assume that each word in a sequence is it’s own token, and [CLS] represents the start of text token.

We could not pass both these texts at the same time:

[CLS]  hello  world                               (length 3)
[CLS]  the     cat     sat    on     the    mat   (length 7)

Therefore we must pad the smaller sequences up to the same size as the longest sequence, which brings us to – do we pad the left side of the sequence or the right?

from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2", add_bos_token=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]', 'bos_token': '[CLS]'});

Right Padding

This seems the most intuitive, we pad the right side of the input:

prompts = ["hello world", "the cat sat on the mat"]

tokenizer.padding_side = "right"
batch_tokens = tokenizer(prompts, padding="longest").input_ids
for tokens in batch_tokens:
    print(" | ".join(f"{tokenizer.decode([token]):6}" for token in tokens))
[CLS]  | hello  |  world | [PAD]  | [PAD]  | [PAD]  | [PAD] 
[CLS]  | the    |  cat   |  sat   |  on    |  the   |  mat  

Left Padding

tokenizer.padding_side = "left"
batch_tokens = tokenizer(prompts, padding="longest").input_ids
for tokens in batch_tokens:
    print(" | ".join(f"{tokenizer.decode([token]):6}" for token in tokens))
[PAD]  | [PAD]  | [PAD]  | [PAD]  | [CLS]  | hello  |  world
[CLS]  | the    |  cat   |  sat   |  on    |  the   |  mat  

Encoder vs Decoder Models

Encoder

For encoder style models (BERT) we extract the final latent representation of the first [CLS] token at the start of the sequence. Due to full attention, this includes all the information from the entire sequence.

For encoder models, right padding makes sense, the [CLS] token is always in the first position and makes it easy to extract.

residual_output: Float[Tensor, "batch seq d_model"]
cls_resid = residual[:, 0, :] # extract [CLS] residual (index 0)
extract
  |
  v
[CLS]  hello  world  [PAD]  [PAD]

Decoder

For decoder style models (GPT2) we want to autoregressively generate the next token based on the final token in the sequence.

residual_output: Float[Tensor, "batch seq d_model"]
final_tok_resid = residual[:, -1, :] # extract residual of final token (to predict the next one)
logits = W_U @ final_tok_resid

Padding before the bos token ([CLS]) allows our model to ignore the variable length padding when autoregressively generating an output.

[PAD]  [PAD]  [CLS]  hello  world  ->  (next token prediction)