πŸ‘Ό Simple Transformer

Author: Eshan Jayasundara
Last Updated: 2nd of March 2025
Created: 28th of February 2025
About:
    └── Single head transformer (Transformer with self-attention training with teacher-forcing)
Training:
    └── Teacher Forcing (Baseline)
            β”œβ”€β”€ During training, the actual ground-truth tokens (from the dataset) are fed as input to the decoder instead of using the model’s own predictions.
            β”œβ”€β”€ This makes training faster and ensures the model learns accurate token-to-token mappings.
            └── Drawback: At inference time, the model doesn't see ground-truth inputs, so errors can accumulate (called exposure bias).
Vocabulary dataset (from huggingface):
    └── "yukiarimo/english-vocabulary"
Simple Transformer Architecture:

Simple Transformer Architecture

    Encoder
        β”œβ”€β”€ Input text
        β”‚        └── Eg: "Hello, how are you?"
        β”œβ”€β”€ Remove punctuation from input text
        β”œβ”€β”€ Input tokenization
        β”œβ”€β”€ Embedding lookup with torch.nn.Embedding
        β”œβ”€β”€ Positional encoding (sin, cosine)
        β”œβ”€β”€ Self-attention
        β”‚        β”œβ”€β”€ single-head
        β”‚        β”œβ”€β”€ Q = Wq @ Embedding
        β”‚        β”œβ”€β”€ K = Wk @ Embedding
        β”‚        └── V = Wv @ Embedding
        β”œβ”€β”€ Add and norm
        β”œβ”€β”€ Feed forward layer
        β”‚        β”œβ”€β”€ 2 hidden layers
        β”‚        β”œβ”€β”€ ReLU as the activation in hidden layer
        β”‚        β”œβ”€β”€ No activation at the output layer
        β”‚        └── nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim)
        β”œβ”€β”€ Add and norm (again)
        └── Save encoder out to be used in cross attention

    Decoder
        β”œβ”€β”€ Decoder teacher text (same as the target text but shifted right)
        β”‚        β”œβ”€β”€ Eg: Decoder teacher text - "<SOS> hello, I'm fine."
        β”‚        └── Eg: target text - "hello, I'm fine. <EOS>"
        β”œβ”€β”€ Remove punctuation from input text
        β”œβ”€β”€ Input tokenization
        β”œβ”€β”€ Embedding lookup with torch.nn.Embedding
        β”œβ”€β”€ Positional encoding (sin, cosine)
        β”œβ”€β”€ Masked-self-attention (single-head, new class signature for masked self attention introduced)
        β”‚        β”œβ”€β”€ single-head
        β”‚        β”œβ”€β”€ causal mask with triangular matrix
        β”‚        β”œβ”€β”€ Q = Wq @ Embedding
        β”‚        β”œβ”€β”€ K = Wk @ Embedding
        β”‚        └── V = Wv @ Embedding
        β”œβ”€β”€ Add and norm
        β”œβ”€β”€ Cross attention (same class signature used in the encoder self-attention can be used)
        β”‚        β”œβ”€β”€ single-head
        β”‚        β”œβ”€β”€ Q = Wq @ Add and normalized output from masked-self-attention
        β”‚        β”œβ”€β”€ K = Wk @ Encoder output
        β”‚        └── V = Wv @ Encoder output
        β”œβ”€β”€ Add and norm
        β”œβ”€β”€ Feed forward layer
        β”‚        β”œβ”€β”€ 2 hidden layers
        β”‚        β”œβ”€β”€ ReLU as the activation in hidden layer
        β”‚        β”œβ”€β”€ No activation at the output layer
        β”‚        └── nn.Linear(in_features=embedding_dim, out_features=d_ff), nn.ReLU(), nn.Linear(in_features=d_ff, out_features=embedding_dim)
        β”œβ”€β”€ Add and norm (again)
        └── Linear layer (No activation or softmax as in 'Attention is all you need' is used here)
        
    Optimization
        β”œβ”€β”€ Initialize the Adam optimizer with the model’s parameters and a specified learning rate.
        β”‚        └── self.optimizer = torch.optim.Adam(params=self.parameters, lr=learning_rate)
        β”œβ”€β”€ Before computing gradients for the current batch, we reset any existing gradients from the previous iteration.
        β”‚        └── self.optimizer.zero_grad()
        β”œβ”€β”€ The model takes in `input_tokens` and `decoder_teacher_tokens` and performs a forward pass to compute `logits`
        β”‚        └── logits = self.forward(input_tokens, decoder_teacher_tokens)
        β”œβ”€β”€ The cross-entropy loss
        β”‚        β”œβ”€β”€ Measures the difference between the predicted token distribution (logits) and the actual target tokens (decoder_target_tokens).
        β”‚        β”œβ”€β”€ It expects logits to have raw scores (not probabilities), and it applies softmax internally.
        β”‚        └── loss = F.cross_entropy(logits, decoder_target_tokens)
        β”œβ”€β”€ Compute the gradients of the loss with respect to all trainable parameters in the model using automatic differentiation (backpropagation).
        β”‚        └── loss.backward()
        └── Optimizer updates the model's weights using the computed gradients.
                 └── self.optimizer.step()
    
    After training, to calculate the output tokens -> text, 'Autoregressive text generation' is used (one word at a time)
        β”œβ”€β”€ Start with <SOS>. (Initial input to the decoder) but input to the encoder is the `prompt`.
        β”œβ”€β”€ Model predicts the next token.
        β”œβ”€β”€ Append the predicted token to the sequence.
        β”œβ”€β”€ Repeat until an <EOS> token or max length is reached.
        └── For illustration let's use words instead of tokens(numerical representation)
                <SOS>
                <SOS> hello
                <SOS> hello I'm
                <SOS> hello I'm good
                <SOS> hello I'm good <EOS>
Feauter Improvements:
    β”œβ”€β”€ Multi-head attention instead of single-head attention.
    β”œβ”€β”€ Layer normalization instead of simple mean-variance normalization.
    └── Dropout layers for better generalization.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.