from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2", add_bos_token=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]', 'bos_token': '[CLS]'});TIL: Left vs Right Padding
TL;DR You should use left padding for autoregressive LLMs (GPT2), but right padding for encoder LLMs (BERT)
Why do we need padding?
Language models expect a tensor of shape (batch_size, sequence_length), therefore each sequence in the batch must be of the same length.
However not all input texts are of the same size (obviously :))
Let us assume that each word in a sequence is it’s own token, and [CLS] represents the start of text token.
We could not pass both these texts at the same time:
[CLS] hello world (length 3)
[CLS] the cat sat on the mat (length 7)
Therefore we must pad the smaller sequences up to the same size as the longest sequence, which brings us to – do we pad the left side of the sequence or the right?
Right Padding
This seems the most intuitive, we pad the right side of the input:
prompts = ["hello world", "the cat sat on the mat"]
tokenizer.padding_side = "right"
batch_tokens = tokenizer(prompts, padding="longest").input_ids
for tokens in batch_tokens:
print(" | ".join(f"{tokenizer.decode([token]):6}" for token in tokens))[CLS] | hello | world | [PAD] | [PAD] | [PAD] | [PAD]
[CLS] | the | cat | sat | on | the | mat
Left Padding
tokenizer.padding_side = "left"
batch_tokens = tokenizer(prompts, padding="longest").input_ids
for tokens in batch_tokens:
print(" | ".join(f"{tokenizer.decode([token]):6}" for token in tokens))[PAD] | [PAD] | [PAD] | [PAD] | [CLS] | hello | world
[CLS] | the | cat | sat | on | the | mat
Encoder vs Decoder Models
Encoder
For encoder style models (BERT) we extract the final latent representation of the first [CLS] token at the start of the sequence. Due to full attention, this includes all the information from the entire sequence.
For encoder models, right padding makes sense, the [CLS] token is always in the first position and makes it easy to extract.
residual_output: Float[Tensor, "batch seq d_model"]
cls_resid = residual[:, 0, :] # extract [CLS] residual (index 0)extract
|
v
[CLS] hello world [PAD] [PAD]
Decoder
For decoder style models (GPT2) we want to autoregressively generate the next token based on the final token in the sequence.
residual_output: Float[Tensor, "batch seq d_model"]
final_tok_resid = residual[:, -1, :] # extract residual of final token (to predict the next one)
logits = W_U @ final_tok_residPadding before the bos token ([CLS]) allows our model to ignore the variable length padding when autoregressively generating an output.
[PAD] [PAD] [CLS] hello world -> (next token prediction)