from transformers import GPT2Tokenizer
= GPT2Tokenizer.from_pretrained("gpt2", add_bos_token=True)
tokenizer 'pad_token': '[PAD]', 'bos_token': '[CLS]'}); tokenizer.add_special_tokens({
TIL: Left vs Right Padding
TL;DR You should use left padding for autoregressive LLMs (GPT2), but right padding for encoder LLMs (BERT)
Why do we need padding?
Language models expect a tensor of shape (batch_size, sequence_length)
, therefore each sequence in the batch must be of the same length.
However not all input texts are of the same size (obviously :))
Let us assume that each word in a sequence is it’s own token, and [CLS]
represents the start of text token.
We could not pass both these texts at the same time:
[CLS] hello world (length 3)
[CLS] the cat sat on the mat (length 7)
Therefore we must pad the smaller sequences up to the same size as the longest sequence, which brings us to – do we pad the left side of the sequence or the right?
Right Padding
This seems the most intuitive, we pad the right side of the input:
= ["hello world", "the cat sat on the mat"]
prompts
= "right"
tokenizer.padding_side = tokenizer(prompts, padding="longest").input_ids
batch_tokens for tokens in batch_tokens:
print(" | ".join(f"{tokenizer.decode([token]):6}" for token in tokens))
[CLS] | hello | world | [PAD] | [PAD] | [PAD] | [PAD]
[CLS] | the | cat | sat | on | the | mat
Left Padding
= "left"
tokenizer.padding_side = tokenizer(prompts, padding="longest").input_ids
batch_tokens for tokens in batch_tokens:
print(" | ".join(f"{tokenizer.decode([token]):6}" for token in tokens))
[PAD] | [PAD] | [PAD] | [PAD] | [CLS] | hello | world
[CLS] | the | cat | sat | on | the | mat
Encoder vs Decoder Models
Encoder
For encoder style models (BERT) we extract the final latent representation of the first [CLS]
token at the start of the sequence. Due to full attention, this includes all the information from the entire sequence.
For encoder models, right padding makes sense, the [CLS]
token is always in the first position and makes it easy to extract.
"batch seq d_model"]
residual_output: Float[Tensor, = residual[:, 0, :] # extract [CLS] residual (index 0) cls_resid
extract
|
v
[CLS] hello world [PAD] [PAD]
Decoder
For decoder style models (GPT2) we want to autoregressively generate the next token based on the final token in the sequence.
"batch seq d_model"]
residual_output: Float[Tensor, = residual[:, -1, :] # extract residual of final token (to predict the next one)
final_tok_resid = W_U @ final_tok_resid logits
Padding before the bos token ([CLS]
) allows our model to ignore the variable length padding when autoregressively generating an output.
[PAD] [PAD] [CLS] hello world -> (next token prediction)