In Part 5, I assembled the complete GPT medium model and validated its architecture with forward passes and text generation. In Part 6, I moved into the crucial stage of pretraining. I set out to understand the basics of pretraining by building a complete, reproducible pipeline around a GPT‑2 style model I call SydsGPT. In this part, I pretrained the model on 20 books from Project Gutenberg and used it to generate grammatically correct text. The focus was not scale, but clarity, control, and the groundwork for a private assistant that can operate without privacy concerns.

Why small models? Because I cannot compete with AI labs that train giant models on thousands of GPUs. My aim is to train a compact language model around 200M parameters on approximately 3B tokens, fine‑tune it for domain‑specific tasks, and extend it with tool calling for web search and RAG to interact with private data. This journey is about exploring the art of the possible with small models.

Configuring and instantiating a GPT‑2 345M style model

I started by defining SydsGPT with a GPT‑2 345M like configuration. The setup includes vocabulary size, context length, embedding dimension, number of heads and layers, dropout, and whether to include QKV biases. I fixed a manual seed for reproducibility, moved the model to the available device (cuda), and switched it to eval mode for deterministic behavior.

import torch
from model.SydsGPT import SydsGPT

SYDSGPT_CONFIG_345M = {
    "vocab_size" : 50257,
    "context_length" : 512,
    "embedding_dim" : 1024,
    "num_heads" : 16,
    "num_layers" : 24,
    "dropout" : 0.1,
    "qkv_bias" : False
}

torch.manual_seed(246)
model = SydsGPT(SYDSGPT_CONFIG_345M)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
  • Inputs: Integer token IDs shaped (batch size, sequence length)
  • Outputs: Logits shaped (batch size, sequence length, vocab size)
  • Device tips: Use GPU if available; consider bfloat16/float16 for inference where supported
  • Common pitfalls: Ensure model/ and modules/ are importable, num_heads divides embedding_dim, do not call model.train() during inference

Minimal text generation with GPT‑2 BPE

To verify the model and tokenizer integration, I built a thin encode/decode wrapper with tiktoken and a simple generation loop. This confirmed end‑to‑end functionality

from modules.GenerateSimple import generate_simple
import tiktoken

def text_to_tokens(text, tokenizer):
    tokens = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
    return tokens

def tokens_to_text(tokens, tokenizer):
    text = tokenizer.decode(tokens.squeeze(0).tolist())
    return text

tokenizer = tiktoken.get_encoding("gpt2")

Then I ran a small sample:

input_text = "Once upon a time"
input_tokens = text_to_tokens(input_text, tokenizer)
output_tokens = generate_simple(model, input_tokens, 100, SYDSGPT_CONFIG_345M['context_length'])
output_text = tokens_to_text(output_tokens, tokenizer)

print(f"Output Text: {output_text}")

Outcome: The model generated mostly gibberish continuations from a seed prompt.

Batch inference and greedy selection

Next I tested batched inference on two prompts. I computed logits, converted them to probabilities, and selected the most likely tokens via greedy argmax to get a quick qualitative signal.

example_input_text_1 = "The quick brown fox"
example_target_text_1 = " quick brown fox jumps"

example_input_text_2 = "In a galaxy far"
example_target_text_2 = " a galaxy far away"

input_tokens_1 = text_to_tokens(example_input_text_1, tokenizer)
input_tokens_2 = text_to_tokens(example_input_text_2, tokenizer)

target_tokens_1 = text_to_tokens(example_target_text_1, tokenizer)
target_tokens_2 = text_to_tokens(example_target_text_2, tokenizer)

batch_input_tokens = torch.cat([input_tokens_1, input_tokens_2], dim=0)
batch_target_tokens = torch.cat([target_tokens_1, target_tokens_2], dim=0)

print(f"Batch Input Tokens Shape: {batch_input_tokens.shape}")
print(f"Batch Input Tokens: {batch_input_tokens}")
print(f"Batch Target Tokens Shape: {batch_target_tokens.shape}")
print(f"Batch Target Tokens: {batch_target_tokens}")

with torch.no_grad():
    logits = model(batch_input_tokens)
probs = torch.softmax(logits, dim = -1)

generated_tokens = torch.argmax(probs, dim = -1, keepdim = True)
print(f"Generated Tokens: \n{generated_tokens}")

print(f"Target text for example 1: {example_target_text_1}")
print(f"Generated text for example 1: {tokens_to_text(generated_tokens[0].flatten(), tokenizer)}")
print(f"Target text for example 2: {example_target_text_2}")
print(f"Generated text for example 2: {tokens_to_text(generated_tokens[1].flatten(), tokenizer)}")

Goal: Validate shapes, decoding, and baseline behavior under greedy prediction

Selecting target probabilities and estimating a simple loss

I extracted probabilities for target tokens using advanced indexing, converted them to log‑probs, and averaged a negative log‑probability as a proxy for loss. This was a hands‑on way to inspect model confidence across positions

batch_index = 0
target_probs_1 = probs[batch_index, [0,1,2,3], batch_target_tokens[batch_index]]
print(f"Target probabilities for example 1: {target_probs_1}")

batch_index = 1
target_probs_2 = probs[batch_index, [0,1,2,3], batch_target_tokens[batch_index]]
print(f"Target probabilities for example 2: {target_probs_2}")

log_probs = torch.log(torch.cat((target_probs_1, target_probs_2)))
print(f"Log probabilities: {log_probs}")

mean_log_probs = torch.mean(log_probs)
print(f"Mean log probability: {mean_log_probs}")

negative_mean_log_probs = mean_log_probs * -1
print(f"Negative mean log probability (loss): {negative_mean_log_probs}")

Note: Broadcasting across differing index shapes yields a grid of probabilities, useful for exploratory inspection rather than per‑step alignment

Inspecting logits and computing cross‑entropy

I compared the manual loss proxy with the standard cross‑entropy by flattening logits and targets into the expected shapes. This confirmed consistency

print(f"Logits shape: {logits.shape}")
print(f"Logits: {logits}")

print(f"Targets shape: {batch_target_tokens.shape}")
print(f"Targets: {batch_target_tokens}")

flat_logits = logits.flatten(0, 1)
flat_targets = batch_target_tokens.flatten()

print(f"Flattened Logits shape: {flat_logits.shape}")
print(f"Flattened Targets shape: {flat_targets.shape}")

loss_fn = torch.nn.functional.cross_entropy
loss = loss_fn(flat_logits, flat_targets)
print(f"Cross-entropy loss: {loss}")

Why flatten: Loss functions expect (N, C) logits and (N,) targets; treating every time step as a classification example is standard for language modelin

Loading the corpus and estimating tokens

I read the combined raw corpus from disk and reported character and token counts to estimate the training budget under GPT‑2 BPE.

data_file_path = "data/all_books.txt"
with open(data_file_path, 'r', encoding = 'utf-8') as books:
    text_data = books.read()

print(f"Total Characters: {len(text_data)}")
print(f"Total Tokens after encoding: {len(tokenizer.encode(text_data))}")

Reminder: GPT‑2 BPE uses a byte‑level vocabulary of 50,257 where token count is not the same as word count.

Total Characters: 19849702. Total Tokens after encoding: 5611150

Building training and validation DataLoaders

I split the text into train and validation subsets and constructed DataLoaders that yield (x, y) batches for next‑token prediction with a fixed context window

training_ratio = 0.9
training_size = int(training_ratio * len(text_data))
training_dataset = text_data[:training_size]
validation_dataset = text_data[training_size:]

from modules.DataLoader import create_dataloader

training_dataloader = create_dataloader(
    training_dataset,
    max_length = SYDSGPT_CONFIG_345M['context_length'],
    step_size = SYDSGPT_CONFIG_345M['context_length'],
    batch_size = 8,
    shuffle = True,
    drop_last = True,
    num_workers = 0,
)

validation_dataloader = create_dataloader(
    validation_dataset,
    max_length = SYDSGPT_CONFIG_345M['context_length'],
    step_size = SYDSGPT_CONFIG_345M['context_length'],
    batch_size = 8,
    shuffle = True,
    drop_last = True,
    num_workers = 0,
)

print(f"Number of training batches: {len(training_dataloader)}")

print("Training loader:")
for x, y in training_dataloader:
    print(x.shape, y.shape)
    break

print(f"Number of validation batches: {len(validation_dataloader)}")

print("Validation loader:")
for x, y in validation_dataloader:
    print(x.shape, y.shape)
    break
  • Key parameters: Context length defines window size; stride equals context length to avoid overlap in this configuration
  • Batch shapes: x and y are LongTensors of shape (batch size, max length)
  • Tip: Adjust batch size or max length based on memory constraints

Utilities for computing loss

I wrote two small helpers: one for per‑batch loss and one for averaging loss over a loader. These make evaluation and training loops concise and consistent.

def calc_batch_loss(input_batch, target_batch, model, device):
    input_batch = input_batch.to(device)
    target_batch = target_batch.to(device)
    logits = model(input_batch)
    loss = torch.nn.functional.cross_entropy(logits.flatten(0,1), target_batch.flatten())
    return loss
def calc_loader_loss(data_loader, model, device, num_batches = None):
    total_loss = 0
    if len(data_loader) == 0:
        return float('nan')
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for batch_index, (input_batch, target_batch) in enumerate(data_loader):
        if batch_index >= num_batches:
            break
        else:
            batch_loss = calc_batch_loss(input_batch, target_batch, model, device)
            total_loss += batch_loss.item()
    return total_loss / num_batches

Averaging semantics: Each batch loss is a mean per token; the loader loss averages those batch means equally

Baseline training and validation losses

Before training, I computed baseline losses without autograd to sanity check the pipeline.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model.to(device)
with torch.no_grad():
    training_loss = calc_loader_loss(training_dataloader, model, device)
    validation_loss = calc_loader_loss(validation_dataloader, model, device)

print(f"Initial Training Loss: {training_loss}")
print(f"Initial Validation Loss: {validation_loss}")

Expectation: With random initialization and a vocab of 50,257, an initial loss near ln(50257) ≈ 10.82 is typical

Sample generation helper during training

To monitor qualitative progress, I added a utility that generates text from a start context at the end of each epoch.

def generate_sample_text(model, tokenizer, device, start_context):
    model.eval()
    context_size = SYDSGPT_CONFIG_345M['context_length']
    input_tokens = text_to_tokens(start_context, tokenizer).to(device)
    with torch.no_grad():
        generated_tokens = generate_simple(model, input_tokens, 100, context_size)
    generated_text = tokens_to_text(generated_tokens, tokenizer)
    print(f"Generated Text: {generated_text}".replace("\n", " "))
    model.train()

Training loop with periodic evaluation and checkpointing

I trained the model with AdamW, tracked tokens processed, evaluated losses periodically, saved autosave checkpoints, and printed sample generations after each epoch.

def train_model_v1(model, training_dataloader, validation_dataloader, optimizer, device, num_epochs, evaluation_frequency, evaluation_iterations, start_context, tokenizer, checkpoint_interval = 500):
    training_losses, validation_losses, total_tokens_processed = [], [], []
    tokens_processed = 0
    global_step = -1

    for epoch in range(num_epochs):
        model.train()
        for input_batch, target_batch in training_dataloader:
            optimizer.zero_grad()
            loss = calc_batch_loss(input_batch, target_batch, model, device)
            loss.backward()
            optimizer.step()
            tokens_processed += input_batch.numel()
            global_step += 1
            total_tokens_processed.append(tokens_processed)
            print(f"Epoch {epoch+1}, Step {global_step}: Tokens Processed = {tokens_processed}")

            if global_step % evaluation_frequency == 0:
                model.eval()
                with torch.no_grad():
                    training_loss = calc_loader_loss(training_dataloader, model, device, evaluation_iterations)
                    validation_loss = calc_loader_loss(validation_dataloader, model, device, evaluation_iterations)
                    training_losses.append(training_loss)
                    validation_losses.append(validation_loss)                    
                    print(f"Epoch {epoch+1}, Step {global_step}: Training Loss = {training_loss}, Validation Loss = {validation_loss}, Tokens Processed = {tokens_processed}")
                model.train()

            if global_step % checkpoint_interval == 0:
                torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, "autosave_sydsgpt_345m_trained_model_optimizer.pth")

        
        generate_sample_text(model, tokenizer, device, start_context)

    return training_losses, validation_losses, total_tokens_processed

I ran the training as follows, initially for 5 epochs. It took me approx. 11 hours per epoch of training on my 3080 Ti GPU:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

torch.manual_seed(246)
model = SydsGPT(SYDSGPT_CONFIG_345M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.0002, weight_decay = 0.05)
num_epochs = 5
training_losses, validation_losses, total_tokens_processed = train_model_v1(
    model,
    training_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 100,
    evaluation_iterations = 2,
    start_context = "Once upon a time",
    tokenizer = tokenizer
)

Checkpointing: Saved a final checkpoint after training

torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, "sydsgpt_345m_trained_model_optimizer.pth")

Restoring from checkpoint and continuing training

I restored the model and optimizer states from checkpoint, relocated optimizer tensors to the correct device, and generated a sample to verify the restore. Then I continued training for a couple more epochs.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPT(SYDSGPT_CONFIG_345M)
checkpoint = torch.load("sydsgpt_345m_trained_model_optimizer.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.0002, weight_decay = 0.05)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)
model.to(device)
generate_sample_text(model, tokenizer, device, "once upon a time")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPT(SYDSGPT_CONFIG_345M)
checkpoint = torch.load("sydsgpt_345m_trained_model_optimizer.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = torch.optim.AdamW(model.parameters(), lr = 0.0002, weight_decay = 0.05)
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)
model.to(device)

num_epochs = 2
training_losses, validation_losses, total_tokens_processed = train_model_v1(
    model,
    training_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 100,
    evaluation_iterations = 2,
    start_context = "Once upon a time",
    tokenizer = tokenizer
)

I also generated a longer sample to inspect coherence:

model.eval()
output_tokens = generate_simple(model, text_to_tokens("once upon a time", tokenizer).to(device), 200, SYDSGPT_CONFIG_345M['context_length'])
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text: {output_text}")

Sample excerpt: The output included multi‑sentence, grammatically correct text with recurring narrative structures and dialogue markers

Output Text: once upon a time before she was so busy, that I felt quite sure that I felt quite sure that I was not quite sure that I felt it was. “You are a child,” said I, “that you are a beautiful woman, and you are a beautiful woman.” “Yes,” said Ada, “that there is nothing in it.” “That is,” said my guardian, “that there is nothing else that makes it so.” “That is not,” said my guardian, “that there is such a time as you are.” “You are not to be removed,” said my guardian, “that there is no considerable answer.” “You are not to be always happy,” said my guardian, “that there is something of the kind

Exploring sampling strategies: probabilities, temperature, and top‑k

To understand sampling dynamics, I built a small illustrative example using a toy vocabulary and logits. I compared greedy selection with multinomial sampling and examined how temperature and top‑k filtering shape token distributions

example_vocab = {
    "once" : 0,
    "upon" : 1,
    "a" : 2,
    "time" : 3,
    "before" : 4,
    "she" : 5,
    "lived" : 6,
    "happily" : 7,
    "ever" : 8,
    "after" : 9
}
inverse_example_vocab = {v: k for k, v in example_vocab.items()}

example_next_token_logits = torch.tensor([1.35, 1.86, 1.53, 0.17, 3.63, -1.82, -2.17, -3.90, -4.85, -5.38])
example_next_token_probs = torch.softmax(example_next_token_logits, dim = 0)
example_greedy_next_token = torch.argmax(example_next_token_probs).item()
print(f"Greedy Next Token: {inverse_example_vocab[example_greedy_next_token]}")

torch.manual_seed(246)
example_random_next_token = torch.multinomial(example_next_token_probs, num_samples = 1).item()
print(f"Random Next Token: {inverse_example_vocab[example_random_next_token]}")

def get_sampled_tokens(probs):
    sampled_token = [torch.multinomial(probs, num_samples = 1).item() for i in range(1000)]
    sampled_tokens = torch.bincount(torch.tensor(sampled_token))
    for i, frequency in enumerate(sampled_tokens):
        print(f"Token: {inverse_example_vocab[i]}: {frequency.item()} times")

get_sampled_tokens(example_next_token_probs)

Temperature scaling:

def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    probs = torch.softmax(scaled_logits, dim = 0)
    return probs

temperatures = [0.1, 0.5, 1.0, 2.0]
for temp in temperatures:
    temperature_scaled_probs = softmax_with_temperature(example_next_token_logits, temp)
    print(f"\n Temperature: {temp}")
    get_sampled_tokens(temperature_scaled_probs)

Top‑k filtering

top_k = 4
top_k_logits, top_k_indices = torch.topk(example_next_token_logits, top_k)
print(f"Top-{top_k} Indices: {top_k_indices}")
print(f"Top-{top_k} Logits: {top_k_logits}")

new_logits = torch.where(
    condition = example_next_token_logits < top_k_logits[-1],
    input = torch.tensor(float('-inf')),
    other = example_next_token_logits
)

print(f"New Logits after Top-{top_k} filtering: {new_logits}")

top_k_probs = torch.softmax(new_logits, dim = 0)
print(f"Top-{top_k} Probabilities: {top_k_probs}")
get_sampled_tokens(top_k_probs)

Insight: Lower temperature sharpens distributions and favors high‑probability tokens; top‑k truncates the distribution to the k most likely tokens for more controlled sampling

A configurable generation function with temperature and top‑k

I implemented a general generation helper that supports temperature scaling, top‑k filtering, context truncation, and optional EOS termination

def generate(model, input_tokens, max_new_tokens, context_size, temperature = 1.0, top_k = None, eos_id = None):
    for _ in range(max_new_tokens):
        input_context = input_tokens[:, -context_size:]
        with torch.no_grad():
            logits = model(input_context)
            logits = logits[:, -1, :]
            if top_k is not None:
                top_k_logits, _ = torch.topk(logits, top_k)
                min_top_k_logit = top_k_logits[:, -1]
                logits = torch.where(logits < min_top_k_logit, torch.tensor(float('-inf')).to(logits.device), logits)
            if temperature > 0.0:
                logits = logits / temperature
                probs = torch.softmax(logits, dim = -1)
                next_token = torch.multinomial(probs, num_samples = 1)
            else:
                next_token = torch.argmax(logits, dim = -1, keepdim = True)
            if next_token == eos_id:
                break
            input_tokens = torch.cat((input_tokens, next_token), dim = 1)
    return input_tokens

I used it to generate longer outputs with controlled sampling:

torch.manual_seed(246)
input_text = "once upon a time"
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(model, input_tokens, 200, SYDSGPT_CONFIG_345M['context_length'], temperature = 1.5, top_k = 40)
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text:\n {output_text}")

Observation: With temperature and top‑k tuned, the outputs remained grammatical while introducing variability and stylistic detail

Output Text: once upon a time after his arrival. But again had that time too already settled in the possibility of talking in the existing histories given personal opportunity of reproachfulness? Hath it not not been brought together simply that one who had not always tried? And the most wonderful man, in a sort of unbension with which he had been capable of using the money by a man who had done so intimatelyision and must not think about as a politician be better in his physical conversation, for whose knowledge there must give a reference to the facts (a lady, especially on purpose, placed upright in their hands) of the unhappy man. The victim might receive her reason to be as much as a hypocrite as possible, but of having supposed she to do as, as it came upon him, as a mode of their being a woman. The latter part of his respect took place to him as much as much as possible to give it him

What I learned in Part 6

  • Pretraining basics: I built and validated a complete language modeling pipeline including data loading, tokenization, batching, loss computation, training, evaluation, sampling, and checkpointing.
  • Grammatically correct generation: After training on 20 books from Project Gutenberg, SydsGPT produced coherent, grammatical text with clear sentence structure and narrative elements.
  • Reproducibility: Fixed seeds, consistent device handling, and periodic checkpoints made the process repeatable and auditable.
  • Sampling behavior: Temperature and top‑k are powerful controls over style, diversity, and determinism during generation.

Try It Yourself

The full notebook with all the steps, from preparing the corpus, data loaders, loss computation, pretraining loop, text sampling and generation, is available here:

SydsGPT pretraining Repository

Clone the repo, open the Jupyter notebook, and step through the code.

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

What comes next

Part 7 will focus on training optimization techniques to push efficiency on limited hardware.

  • Mixed precision training: Reduce memory footprint and increase throughput by using bfloat16/float16 safely during training.
  • Gradient accumulation: Simulate larger batch sizes without exceeding memory limits.
  • Flash attention: Optimize attention computation for speed and memory efficiency.
  • KV cache: Speed up autoregressive generation by caching key/value tensors across steps.

The roadmap remains clear. I will train a small language model around 200M parameters on a diverse corpus of approximately 3B tokens, then fine‑tune it on domain‑specific data. I will add tool calling for web search and build a RAG pipeline to interact with private data. The aim is a private assistant that respects privacy and delivers practical value, proving that small models can go far when engineered with care

Source Code

pretraining

Leave a Reply

Your email address will not be published. Required fields are marked *