In Part 4, I focused on attention and built reusable modules that mirror transformer internals. In Part 5, I assembled the complete GPT architecture at medium scale, validated shapes and memory, and ran first text generation. The outputs are gibberish because the model is untrained. That is expected.

The goal here is to make sure the architecture is sound and the end-to-end pipeline works. In Part 6, I will pre-train the model.

Model configuration and setup

This configuration targets a GPT-2 medium scale model, closely mirroring common hyperparameters at that size.

SYDSGPT_CONFIG_345M = {
    "vocab_size" : 50257,
    "context_length" : 1024,
    "embedding_dim" : 1024,
    "num_heads" : 16,
    "num_layers" : 24,
    "dropout" : 0.1,
    "qkv_bias" : False
}
  • Context length: 1,024
  • Embedding dim: 1,024
  • Heads: 16
  • Layers: 24
  • Dropout: 0.1
  • QKV bias: False
Placeholder GPT for structure validation

I begin with a skeleton GPT to validate the computation graph and shapes. This wires embeddings, a stack of transformer blocks, layer norm, and output projection.

class PlaceholderGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
        self.position_embedding = nn.Embedding(config["context_length"], config["embedding_dim"])
        self.dropout = nn.Dropout(config["dropout"])
        self.transformer_blocks = nn.Sequential(*[PlaceholderTransformerBlock(config) for _ in range(config["num_layers"])])
        self.final_layer_norm = PlaceholderLayerNorm(config["embedding_dim"])
        self.output_projection = nn.Linear(config["embedding_dim"], config["vocab_size"], bias = False)

    def forward(self, input):
        batch_size, seq_length = input.shape
        token_embeddings = self.token_embedding(input)
        position_embeddings = self.position_embedding(torch.arange(seq_length, device = input.device))
        x = token_embeddings + position_embeddings
        x = self.dropout(x)
        x = self.transformer_blocks(x)
        x = self.final_layer_norm(x)
        logits = self.output_projection(x)
        return logits
    
class PlaceholderTransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        
    def forward(self, x):
        return x
    
class PlaceholderLayerNorm(nn.Module):
    def __init__(self, embedding_dim, eps = 1e-5):
        super().__init__()
                
    def forward(self, x):
        return x
Tokenization and forward pass sanity check

I validate that the forward pass produces logits of shape (batch, seq_len, vocab_size) using GPT-2 tokenizer.

tokenizer = tiktoken.get_encoding("gpt2")
batch = []
ex1 = "Hello how are you"
ex2 = "What are you doing"
batch.append(torch.tensor(tokenizer.encode(ex1)))
batch.append(torch.tensor(tokenizer.encode(ex2)))
batch = torch.stack(batch, dim = 0)
print(batch)

torch.manual_seed(246)
test_model = PlaceholderGPT(SYDSGPT_CONFIG_345M)
logits = test_model(batch)
print(f"Logits Shape: {logits.shape}")
print(f"Logits: \n {logits}")

Expected shape:

Logits Shape: torch.Size([2, 4, 50257])

This confirms correct wiring from embeddings to output projection.

Manual layer normalization demonstration

Before using a custom class, I show manual normalization to verify behavior.

torch.manual_seed(246)
ex_batch = torch.randn(2,6)
nn_layer = nn.Sequential(nn.Linear(6,8), nn.ReLU())
output = nn_layer(ex_batch)
print(f"Output Shape: {output.shape}")
print(f"Output: \n {output}")

mean = output.mean(dim = -1, keepdim = True)
variance = output.var(dim = -1, keepdim = True)
print(f"Mean: \n {mean}")
print(f"Variance: \n {variance}")

normalized_output = (output - mean) / torch.sqrt(variance)
mean_after_norm = normalized_output.mean(dim = -1, keepdim = True)
variance_after_norm = normalized_output.var(dim = -1, keepdim = True)
print(f"Normalized Output: \n {normalized_output}")
print(f"Mean After Norm: \n {mean_after_norm}")
print(f"Variance After Norm: \n {variance_after_norm}")
  • Observation: Per-sample mean close to zero and variance near one.
Custom LayerNorm implementation and usage

This LayerNorm mirrors the behavior of PyTorch’s builtin but is implemented from scratch for clarity.

class LayerNorm(nn.Module):
    def __init__(self, embedding_dim, eps = 1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(embedding_dim))
        self.shift = nn.Parameter(torch.zeros(embedding_dim))

    def forward(self, x):
        mean = x.mean(dim = -1, keepdim = True)
        variance = x.var(dim = -1, keepdim = True, unbiased = False)
        normalized_x = (x - mean) / torch.sqrt(variance + self.eps)
        return self.scale * normalized_x + self.shift

Usage example:

layer_norm = LayerNorm(embedding_dim = 6)
normalized_output = layer_norm(ex_batch)
print(f"Layer Norm Output: \n {normalized_output}")
mean_after_norm = normalized_output.mean(dim = -1, keepdim = True)
variance_after_norm = normalized_output.var(dim = -1, keepdim = True, unbiased = False)
print(f"Mean After Layer Norm: \n {mean_after_norm}")
print(f"Variance After Layer Norm: \n {variance_after_norm}")
  • Observation: LayerNorm maintains mean near zero and variance near one with learnable scale and shift.
GELU activation from first principles

GELU is the default activation in transformer FFNs.

class GELU(nn.Module):
    def __init__(self):
        super().__init__()
    
    def forward(self, x):
        return 0.5 * x *(1 + torch.tanh(torch.sqrt(torch.tensor(2.0 / torch.pi)) * (x + 0.044715 * torch.pow(x, 3))))
FeedForward network construction and run

Transformer FFN expands and contracts the embedding dimension with GELU nonlinearity.

class FeedForward(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(config["embedding_dim"], 4 * config["embedding_dim"]),
            GELU(),
            nn.Linear(4 * config["embedding_dim"], config["embedding_dim"])
        )
    
    def forward(self, x):
        return self.layers(x)

Example usage:

feed_forward = FeedForward(SYDSGPT_CONFIG_345M)
example_input = torch.randn(2, 6, SYDSGPT_CONFIG_345M["embedding_dim"])
output = feed_forward(example_input)
print(f"Feed Forward Output Shape: {output.shape}")
print(f"Feed Forward Output: \n {output}")

Observation: Output shape preserves (batch, seq_len, embedding_dim).

Residual connections and gradient flow

Residuals are essential for training deep networks. This experiment shows gradients are healthier with residuals.

class ResidualConnectionsTestNN(nn.Module):
    def __init__(self, layer_dims, use_shortcuts):
        super().__init__()
        self.use_shortcuts = use_shortcuts
        self.layers = nn.ModuleList([
            nn.Sequential(nn.Linear(layer_dims[0], layer_dims[1]), GELU()),
            nn.Sequential(nn.Linear(layer_dims[1], layer_dims[2]), GELU()),
            nn.Sequential(nn.Linear(layer_dims[2], layer_dims[3]), GELU()),
            nn.Sequential(nn.Linear(layer_dims[3], layer_dims[4]), GELU()),
            nn.Sequential(nn.Linear(layer_dims[4], layer_dims[5]), GELU()),
            nn.Sequential(nn.Linear(layer_dims[5], layer_dims[6]), GELU())
        ])

    def forward(self, x):
        for layer in self.layers:
            layer_output = layer(x)
            if self.use_shortcuts and x.shape == layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
        return x

def get_gradients(model, input):
    output = model(input)
    target = torch.tensor([[0.]])
    loss_function = nn.MSELoss()
    loss = loss_function(output, target)
    loss.backward()
    for name, param in model.named_parameters():
        if 'weight' in name:
            print(f"Mean of Gradients for {name}: {param.grad.abs().mean().item()}")

layer_dims = [3, 3, 3, 3, 3, 3, 1]
input = torch.tensor([[-1., 0., 1.]])

torch.manual_seed(246)
model_without_residuals = ResidualConnectionsTestNN(layer_dims, use_shortcuts = False)
torch.manual_seed(246)
model_with_residuals = ResidualConnectionsTestNN(layer_dims, use_shortcuts = True)

print("Gradients without Residual Connections:")
get_gradients(model_without_residuals, input)

print("Gradients with Residual Connections:")
get_gradients(model_with_residuals, input)

Observation: Residual connections increase and stabilize gradients across layers.

Transformer block with attention, FFN, layer norm, dropout, and residuals

A single transformer block combines everything into the standard pre-norm residual structure.

from attention.MultiHeadAttention import MultiHeadAttention
class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = MultiHeadAttention(
            input_dim = config["embedding_dim"],
            output_dim = config["embedding_dim"],
            dropout = config["dropout"],
            context_length = config["context_length"],
            num_heads = config["num_heads"],
            qkv_bias = config["qkv_bias"])
        self.layer_norm1 = LayerNorm(config["embedding_dim"])
        self.feed_forward = FeedForward(config)
        self.layer_norm2 = LayerNorm(config["embedding_dim"])
        self.dropout = nn.Dropout(config["dropout"])

    def forward(self, x):
        shortcut = x
        x = self.layer_norm1(x)
        x = self.attention(x)
        x = self.dropout(x)
        x = x + shortcut
        shortcut = x
        x = self.layer_norm2(x)
        x = self.feed_forward(x)
        x = self.dropout(x)
        x = x + shortcut
        return x

Running the block:

torch.manual_seed(246)
transformer = TransformerBlock(SYDSGPT_CONFIG_345M)
example_input = torch.randn(2, 6, SYDSGPT_CONFIG_345M["embedding_dim"])
output = transformer(example_input)
print(f"Example Input Shape: {example_input.shape}")
print(f"Transformer Block Output Shape: {output.shape}")
print(f"Transformer Block Output: \n {output}")

Observation: The block preserves shape (batch, seq_len, embedding_dim).

Full SydsGPT model assembly

The complete GPT model stacks multiple transformer blocks and adds embeddings and final projection

class SydsGPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
        self.position_embedding = nn.Embedding(config["context_length"], config["embedding_dim"])
        self.drop_embedding = nn.Dropout(config["dropout"])
        self.transformer_blocks = nn.Sequential(*[TransformerBlock(config) for _ in range(config["num_layers"])])
        self.final_layer_norm = LayerNorm(config["embedding_dim"])
        self.output_projection = nn.Linear(config["embedding_dim"], config["vocab_size"], bias = False)
    
    def forward(self, input):
        batch_size, seq_length = input.shape
        token_embeddings = self.token_embedding(input)
        position_embeddings = self.position_embedding(torch.arange(seq_length, device=input.device))
        x = token_embeddings + position_embeddings
        x = self.drop_embedding(x)
        x = self.transformer_blocks(x)
        x = self.final_layer_norm(x)
        logits = self.output_projection(x)
        return logits

Forward pass on tokenized inputs

torch.manual_seed(246)
sydsgpt_model = SydsGPT(SYDSGPT_CONFIG_345M)
logits = sydsgpt_model(batch)
print(f"Input: {batch}")
print(f"Logits Shape: {logits.shape}")
print(f"Logits: {logits}")

Observation: Logits shape (batch, seq_len, vocab) matches expectations.

Parameter count and memory footprint

I compute the total trainable parameters and estimate memory usage at float32

total_parameters = sum(parameter.numel() for parameter in sydsgpt_model.parameters())
print(f"Total Parameters in SydsGPT Model: {total_parameters}")

total_size_bytes = total_parameters * 4
total_size_mb = total_size_bytes / (1024 ** 2)
print(f"Total Model Size: {total_size_mb:.2f} MB")
  • Result: 406,212,608 parameters
  • Size: ~1,549.58 MB in float32 for parameters alone

This excludes activations, gradients, optimizer states

Greedy text generation loop

A simple autoregressive loop that uses argmax decoding. It respects the context window and extends sequences token by token.

def generate_simple(model, input_ids, max_length, context_size):
    for _ in range(max_length):
        input_ids_crop = input_ids[:, -context_size:]
        with torch.no_grad():
            logits  = model(input_ids_crop)
        next_token_logits = logits[:, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim = -1)
        next_token = torch.argmax(next_token_probs, dim = -1, keepdim = True)
        input_ids = torch.cat((input_ids, next_token), dim = 1)
    return input_ids

Use it end to end:

start_context = "Once upon a time"
encoded_context = tokenizer.encode(start_context)
input_ids = torch.tensor(encoded_context).unsqueeze(0)
print(f"Encoded Context: {encoded_context}")
print(f"Input IDs Shape: {input_ids.shape}")

sydsgpt_model.eval()

context_size = SYDSGPT_CONFIG_345M["context_length"]
generated_ids = generate_simple(sydsgpt_model, input_ids, 10, context_size)
print(f"Generated IDs: {generated_ids}")

generated_text = tokenizer.decode(generated_ids.squeeze(0).tolist())
print(f"Generated Text: {generated_text}")

Observation: The output text is gibberish. This is expected since the model is untrained. The goal is verifying the generation loop and context handling.

What I validated in Part 5

  • Architecture completeness: Embeddings, transformer blocks, normalization, residuals, and projection are wired correctly.
  • Shape discipline: Every component preserves the expected shapes through the stack.
  • Parameter scale and memory: The model lands at ~406M parameters with an appropriate memory footprint for float32 weights.
  • Generation pipeline: Tokenization, forward pass, logits to next token, and sequence extension work as expected.
  • Stability building blocks: LayerNorm, GELU, FFN, and residual connections are correctly implemented and behaving as intended.

Try It Yourself

The full notebook with all the steps, from basic attention to multi‑head attention, is available here:

SydsGPT notebook Repository

Clone the repo, open the Jupyter notebook, and step through the code. You can experiment with different numbers of heads, embedding dimensions, and masking strategies to see how they affect the outputs.

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

What’s next

Part 6 is pre-training. I will set up the training loop with tokenized corpora, define the loss function for next-token prediction, implement batching at scale, and start training with a robust optimizer. I may also switch to mixed precision to reduce memory and speed up training. The goal is to move from gibberish to coherent text by optimizing on a meaningful dataset.

Source Code

sydsgpt-nb

Leave a Reply

Your email address will not be published. Required fields are marked *