In this part, I scaled a full pretraining pipeline: a ~10B-token corpus, pre-tokenization and chunking for streaming, a Flash Attention replacement inside the GPT blocks, training-loop features (warmup, cosine decay, gradient accumulation), torch.compile for runtime speedups, and GaloreAdamW as the optimizer. I then ran a long single‑GPU pretraining run (~12B tokens over ~11 days on an NVIDIA 3080 Ti). This post documents the full process, explains the design choices, and shows the exact code so readers can reproduce and adapt each step.

Overview

What this part covers

Dataset assembly and preprocessing: combining multiple corpora, pre-tokenization with tiktoken, and chunking into fixed-length shards stored as Parquet for streaming.
Model changes: replacing standard attention with a Flash Attention style implementation using torch.nn.functional.scaled_dot_product_attention, and wiring that into a new transformer block and model class.
Training loop improvements: LR warmup, cosine decay, gradient accumulation, gradient clipping, periodic evaluation, and rotating checkpoints.
Performance engineering: torch.compile usage and runtime flags, mixed-precision considerations, and optimizer selection (GaLoreAdamW).
Run summary and practical lessons from training ~12B tokens on a single 3080 Ti.

Below I walk through each stage and include the notebook code so you can see exactly what was done.

Dataset assembly and tokenization

Goal: build a large, mixed corpus and convert it into tokenized, fixed-length chunks that can be streamed efficiently during training.

Key ideas

Keep raw text columns minimal to save space.
Pre-tokenize with tiktoken (GPT-2 encoding) to get deterministic token counts.
Stream token lists into a buffer and emit fixed-size chunks (here CHUNK_SIZE = 2048) into Parquet shards for efficient, memory-mapped reads.

Code: dataset loading, trimming, concatenation, and saving

fineweb_dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train")
wikipedia_dataset = load_dataset("wikimedia/wikipedia", "20231101.en", split="train")
arxiv_dataset = load_dataset("timaeus/pile-arxiv", split="train")

fineweb_dataset = fineweb_dataset.remove_columns([col for col in fineweb_dataset.column_names if col != 'text'])
wikipedia_dataset = wikipedia_dataset.remove_columns([col for col in wikipedia_dataset.column_names if col != 'text'])
arxiv_dataset = arxiv_dataset.remove_columns([col for col in arxiv_dataset.column_names if col != 'text'])

fineweb_dataset.save_to_disk('data/fineweb_dataset')
wikipedia_dataset.save_to_disk('data/wikipedia_dataset')
arxiv_dataset.save_to_disk('data/arxiv_dataset')

fineweb_dataset = load_from_disk('data/fineweb_dataset')
wikipedia_dataset = load_from_disk('data/wikipedia_dataset')
arxiv_dataset = load_from_disk('data/arxiv_dataset')

fineweb_dataset = fineweb_dataset.shuffle(seed=42)
wikipedia_dataset = wikipedia_dataset.shuffle(seed=42)
arxiv_dataset = arxiv_dataset.shuffle(seed=42)
# trim 50% of the dataset
fineweb_dataset = fineweb_dataset.select(range(len(fineweb_dataset)//2))
wikipedia_dataset = wikipedia_dataset.select(range(len(wikipedia_dataset)//2))
arxiv_dataset = arxiv_dataset.select(range(len(arxiv_dataset)//2))

#Concatenate the datasets
combined_dataset = concatenate_datasets([fineweb_dataset, wikipedia_dataset, arxiv_dataset])


combined_dataset.save_to_disk('data/combined_dataset')

Code: tokenization with tiktoken and saving tokenized dataset

import tiktoken
enc = tiktoken.get_encoding("gpt2")

from typing import Optional, Tuple
from datasets import Dataset

# Tokenize a HF Dataset and save to disk. Returns (tokenized_dataset, total_tokens)
def tokenize_and_save(
    dataset: Dataset,
    out_dir: str,
    text_column: str = 'text',
    keep_text: bool = False,
    batch_size: int = 1000,
    num_proc: Optional[int] = None,
) -> Tuple[Dataset, int]:
    """
    - Tokenizes each row's text using the global `enc` (tiktoken GPT-2).
    - Adds 'input_ids' (List[int]) and 'length' (int) columns.
    - Optionally removes the original 'text' column to save space.
    - Saves the resulting dataset to `out_dir`.
    Returns the tokenized dataset and the total token count.
    """

    def tok_batch(batch):
        texts = batch[text_column]
        input_ids = [enc.encode(t, allowed_special={'<|endoftext|>'}) for t in texts]
        lengths = [len(ids) for ids in input_ids]
        return {'input_ids': input_ids, 'length': lengths}

    remove_cols = None if keep_text else [text_column]

    tokenized = dataset.map(
        tok_batch,
        batched=True,
        batch_size=batch_size,
        num_proc=num_proc,
        remove_columns=remove_cols,
        desc=f"Tokenizing -> {out_dir}",
    )

    # Compute total tokens efficiently by summing the 'length' column
    total_tokens = int(sum(tokenized['length']))

    # Persist to disk
    tokenized.save_to_disk(out_dir)

    return tokenized, total_tokens

dataset_path = 'data/combined_dataset'
tokenized_dataset_path = 'data/combined_tokenized_dataset'

# Load datasets from disk
dataset = load_from_disk(dataset_path)

dataset_tokenized, dataset_token_count = tokenize_and_save(dataset, tokenized_dataset_path, keep_text=False, batch_size=1000, num_proc=None)


print('Tokenized dataset sizes (rows):', {
    'combined_rows': len(dataset_tokenized),
})
print('Per-dataset token counts:', {
    'combined_tokens': dataset_token_count,
})

Why this matters

Pre-tokenization gives you an exact token count and lets you reason about how many chunks and epochs you can run.
Saving tokenized rows to disk avoids repeated tokenization during experiments and makes preprocessing reproducible.

Chunking into fixed-length shards for streaming

Goal: convert variable-length token lists into fixed-length chunks (2048 tokens) and write them into Parquet shards for efficient streaming and reproducible sampling.

Design choices

Buffering: accumulate tokens across rows until you can emit a full chunk.
Shard sizing: choose a shard size (SHARD_SIZE_CHUNKS) that balances file count and I/O throughput.
Train/val split: random assignment at chunk emission time to get an approximate 80/20 split.
Parquet: memory-mapped reads via HF Dataset.from_parquet avoid loading everything into RAM.

Code: chunking pipeline and DataLoader wrappers

import os
import random
from pathlib import Path
from glob import glob
import pyarrow as pa
import pyarrow.parquet as pq
import torch
from torch.utils.data import Dataset as TorchDataset, DataLoader
from datasets import load_from_disk, Dataset

# Config
SOURCE_PATH = 'data/combined_tokenized_dataset'   # variable-length input_ids
OUT_TRAIN_DIR = Path('data/combined_chunks_train_parquet')
OUT_VAL_DIR = Path('data/combined_chunks_val_parquet')
CHUNK_SIZE = 2048
TRAIN_PROB = 0.8          # approximate 80/20 split at chunk level
SHARD_SIZE_CHUNKS = 25000  # number of chunks per parquet shard (tune for memory/disk throughput)
BATCH_SIZE = 2
SEED = 42

random.seed(SEED)
OUT_TRAIN_DIR.mkdir(parents=True, exist_ok=True)
OUT_VAL_DIR.mkdir(parents=True, exist_ok=True)

# Optional: clean old shards (only .parquet files)
for f in list(OUT_TRAIN_DIR.glob('*.parquet')) + list(OUT_VAL_DIR.glob('*.parquet')):
    try:
        f.unlink()
    except Exception:
        pass

# Helper to write a shard of chunks to Parquet
# chunks: List[List[int]] (all must be CHUNK_SIZE long)
def write_parquet_shard(chunks, out_dir: Path, shard_idx: int):
    if not chunks:
        return
    array = pa.array(chunks, type=pa.list_(pa.int32()))
    table = pa.table({'input_ids': array})
    pq.write_table(table, out_dir / f'part-{shard_idx:05d}.parquet')

# Stream over dataset and produce fixed-size chunks
buf = []  # token buffer
train_batch, val_batch = [], []
train_shard, val_shard = 0, 0
train_count, val_count = 0, 0

src = load_from_disk(SOURCE_PATH)
print('Streaming rows:', len(src))

for row in src:
    toks = row['input_ids']
    if not toks:
        continue
    buf.extend(toks)
    while len(buf) >= CHUNK_SIZE:
        chunk = buf[:CHUNK_SIZE]
        del buf[:CHUNK_SIZE]
        if random.random() < TRAIN_PROB:
            train_batch.append(chunk)
            train_count += 1
            if len(train_batch) >= SHARD_SIZE_CHUNKS:
                write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
                train_shard += 1
                train_batch = []
        else:
            val_batch.append(chunk)
            val_count += 1
            if len(val_batch) >= SHARD_SIZE_CHUNKS:
                write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)
                val_shard += 1
                val_batch = []

# Flush leftovers
write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)

print({'train_chunks_written': train_count, 'val_chunks_written': val_count, 'leftover_tokens_dropped': len(buf)})

# Build HF Datasets from Parquet shards (memory-mapped; avoids loading everything at once)
train_parquet_files = sorted(glob(str(OUT_TRAIN_DIR / '*.parquet')))
val_parquet_files = sorted(glob(str(OUT_VAL_DIR / '*.parquet')))

train_hfds = Dataset.from_parquet(train_parquet_files)
val_hfds = Dataset.from_parquet(val_parquet_files)

print({'train_rows': len(train_hfds), 'val_rows': len(val_hfds)})

# Torch wrappers and DataLoaders (no attention_mask), with external shift in collate
class FixedLenHFDataset(TorchDataset):
    def __init__(self, hf_ds: Dataset):
        self.ds = hf_ds
    def __len__(self):
        return len(self.ds)
    def __getitem__(self, idx):
        ids = self.ds[idx]['input_ids']
        return torch.tensor(ids, dtype=torch.long)

def collate_shift(batch):
    x = torch.stack(batch)      # (B, CHUNK_SIZE)
    y = x.clone()
    y[:, :-1] = x[:, 1:]
    y[:, -1] = -100
    return {'input_ids': x, 'targets': y}

train_fixed = FixedLenHFDataset(train_hfds)
val_fixed = FixedLenHFDataset(val_hfds)

training_dataloader = DataLoader(train_fixed, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_shift)
validation_dataloader = DataLoader(val_fixed, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_shift)

Notes on the collate function

The collate_shift function prepares input_ids and targets by shifting tokens left for next-token prediction and using -100 as the ignore index for the final token. This keeps the loss computation simple and efficient.

Model architecture and Flash Attention

Goal: reduce attention memory pressure and improve throughput by using torch.nn.functional.scaled_dot_product_attention (Flash Attention style) while preserving causal masking.

Key points

The FlashAttention class computes Q/K/V via a single linear, reshapes into heads, and calls scaled_dot_product_attention with is_causal=True.
The transformer block (TransformerBlockv2) uses LayerNorm, the FlashAttention module, and a FeedForward module.
The top-level model SydsGPTv2 wires token and position embeddings, a stack of TransformerBlockv2, final layer norm, and an output projection.

Code: FlashAttention, TransformerBlockv2, and SydsGPTv2

class FlashAttention(nn.Module):
    def __init__(self, embedding_dim, num_heads, dropout=0.1):
        super().__init__()
        assert embedding_dim % num_heads == 0, "embedding_dim must be divisible by num_heads"
        self.embedding_dim = embedding_dim
        self.num_heads = num_heads
        self.head_dim = embedding_dim // num_heads
        self.dropout = dropout

        self.qkv = nn.Linear(embedding_dim, 3 * embedding_dim)
        self.out_proj = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, x):
        batch_size, seq_length, _ = x.shape
        qkv = self.qkv(x)
        qkv = qkv.view(batch_size, seq_length, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        queries, keys, values = qkv
        dropout = 0.0 if not self.training else self.dropout
        context_vectors = torch.nn.functional.scaled_dot_product_attention(queries, keys, values, attn_mask = None, dropout_p = dropout, is_causal = True)
        context_vectors = context_vectors.transpose(1, 2).contiguous().view(batch_size, seq_length, self.embedding_dim)
        context_vectors = self.out_proj(context_vectors)
        return context_vectors

from modules.LayerNorm import LayerNorm
from modules.FeedForward import FeedForward
import torch.nn as nn

class TransformerBlockv2(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = FlashAttention(
            embedding_dim = config["embedding_dim"],
            num_heads = config["num_heads"],
            dropout = config["dropout"],
        )
        self.layer_norm1 = LayerNorm(config["embedding_dim"])
        self.feed_forward = FeedForward(config)
        self.layer_norm2 = LayerNorm(config["embedding_dim"])
        self.dropout = nn.Dropout(config["dropout"])

    def forward(self, x):
        shortcut = x
        x = self.layer_norm1(x)
        x = self.attention(x)
        x = self.dropout(x)
        x = x + shortcut
        shortcut = x
        x = self.layer_norm2(x)
        x = self.feed_forward(x)
        x = self.dropout(x)
        x = x + shortcut
        return x

import torch
import torch.nn as nn
from modules.TransformerBlock import TransformerBlock
from modules.LayerNorm import LayerNorm

class SydsGPTv2(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
        self.position_embedding = nn.Embedding(config["context_length"], config["embedding_dim"])
        self.drop_embedding = nn.Dropout(config["dropout"])
        self.transformer_blocks = nn.Sequential(*[TransformerBlockv2(config) for _ in range(config["num_layers"])])
        self.final_layer_norm = LayerNorm(config["embedding_dim"])
        self.output_projection = nn.Linear(config["embedding_dim"], config["vocab_size"], bias = False)
    
    def forward(self, input):
        batch_size, seq_length = input.shape
        token_embeddings = self.token_embedding(input)
        position_embeddings = self.position_embedding(torch.arange(seq_length, device=input.device))
        x = token_embeddings + position_embeddings
        x = self.drop_embedding(x)
        x = self.transformer_blocks(x)
        x = self.final_layer_norm(x)
        logits = self.output_projection(x)
        return logits

Practical validation

Compare logits on a small batch between the FlashAttention model and a baseline to ensure numerical parity within tolerance.
Confirm is_causal=True to preserve autoregressive behavior.
Watch dtype: scaled_dot_product_attention supports mixed precision; ensure your autocast and torch.set_float32_matmul_precision settings align with your hardware.

Compilation and runtime flags

Goal: reduce Python overhead and fuse kernels where possible using torch.compile, and enable safe TF32/precision knobs on Ampere+ GPUs.

Code: performance flags and torch.compile

# Compile model with torch.compile and set performance flags
import torch
import contextlib

# Optional performance knobs (safe on Ampere+ GPUs; harmless on CPU)
try:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cudnn.benchmark = True
except Exception:
    pass

# Prefer higher-precision matmul kernels if available on your hardware
with contextlib.suppress(Exception):
    torch.set_float32_matmul_precision('high')  # 'high' or 'medium'

# Choose a compile configuration
compile_backend = 'inductor'          # default backend
compile_mode = 'default'              # try 'reduce-overhead' or 'max-autotune' later
dynamic_shapes = False                # set True if you plan to change batch size frequently
compile_ok = False


try:
    model = torch.compile(model, backend=compile_backend, mode=compile_mode, dynamic=dynamic_shapes)
    compile_ok = True
    print(f"Model compiled with torch.compile (backend={compile_backend}, mode={compile_mode}, dynamic={dynamic_shapes})")
    print("Note: First iteration includes compile time; subsequent steps are faster.")
except Exception as e:
    print("torch.compile failed; falling back to eager. Error:\n", e)

Notes

The first iteration after torch.compile includes compilation overhead; measure steady-state throughput after warmup.
torch.backends.cudnn.benchmark = True helps when input sizes are stable.
torch.set_float32_matmul_precision('high') can improve matmul performance on supported hardware.

Training loop: warmup, cosine decay, gradient accumulation, and checkpoints

Goals

Stabilize early training with LR warmup.
Use cosine decay to anneal LR smoothly across the full training horizon.
Use gradient accumulation to simulate large effective batch sizes on a single GPU.
Rotate checkpoints to limit disk usage while keeping recent history.

Hyperparameters used in the run

initial_lr = 1e-6, peak_lr = 1e-4, min_lr = 0.1 * peak_lr.
Warmup set to ~2% of steps per epoch.
grad_accum_steps = 64 to scale effective batch size.
checkpoint_interval = 10000 steps (rotating saves).

Code: training function v2 (basic warmup + cosine decay)

import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text

def train_model_v2(model, training_dataloader, validation_dataloader, optimizer, device,
                   num_epochs, evaluation_frequency, start_context,
                   tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr):
    
    training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
    total_tokens_processed, global_step = 0, -1
    total_training_steps = num_epochs * total_steps_per_epoch
    lr_increment = (peak_lr - initial_lr) / warmup_steps

    for epoch in range(num_epochs):
        model.train()
        for batch in training_dataloader:
            optimizer.zero_grad()
            global_step += 1
            if global_step < warmup_steps:
                lr = initial_lr + global_step * lr_increment
            else:
                progress = (global_step - warmup_steps) / (total_training_steps - warmup_steps)
                lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr
                learning_rates.append(lr)
                loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
                loss.backward()

                if global_step >= warmup_steps:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm = 1.0)
                
                optimizer.step()
                training_losses.append(loss.item())
                total_tokens_processed += (batch['input_ids'] != -100).sum().item()
                print(f"Epoch {epoch + 1}, Step {global_step}: Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item()}")

            if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
                model.eval()
                val_batch = next(iter(validation_dataloader))
                with torch.no_grad():
                    val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
                validation_losses.append(val_loss.item())
                print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item()} ---")
                generate_sample_text(model, tokenizer, device, start_context)
                model.train()
            
            if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
                base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
                prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"

                try:
                    if os.path.exists(prev1_ckpt):
                        os.remove(prev1_ckpt)
                except Exception:
                    pass

                try:
                    if os.path.exists(base_ckpt):
                        os.replace(base_ckpt, prev1_ckpt)
                except Exception:
                    pass

                torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
                print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt}")


    return training_losses, validation_losses, total_tokens_processed, learning_rates

Code: optimizer instantiation

from galore_torch import GaLoreAdamW
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)

Code: training function v3 (warmup + cosine + gradient accumulation + rotated checkpoints)

import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text

def train_model_v3(model, training_dataloader, validation_dataloader, optimizer, device,
                   num_epochs, evaluation_frequency, start_context,
                   tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr,
                   grad_accum_steps: int = 1):
    """
    Train with cosine decay + warmup and optional gradient accumulation.

    Notes:
    - LR/warmup here are updated per batch (DataLoader iteration). If you want warmup
      in optimizer steps, compute warmup_steps accordingly (divide by grad_accum_steps).
    - loss is scaled by 1/grad_accum_steps before backward to keep gradients invariant.
    """
    training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
    total_tokens_processed, global_step = 0, -1
    total_training_steps = num_epochs * total_steps_per_epoch
    lr_increment = (peak_lr - initial_lr) / max(1, warmup_steps)
    accum_counter = 0
    
    optimizer.zero_grad(set_to_none=True)
    
    for epoch in range(num_epochs):
        model.train()
        for batch in training_dataloader:
            global_step += 1
            # Learning rate schedule per batch step
            if global_step < warmup_steps:
                lr = initial_lr + global_step * lr_increment
            else:
                progress = (global_step - warmup_steps) / max(1, (total_training_steps - warmup_steps))
                lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
            for pg in optimizer.param_groups:
                pg['lr'] = lr
            learning_rates.append(lr)
            
            # Forward + backward (accumulated)
            loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
            training_losses.append(loss.item())  # log unscaled loss
            (loss / max(1, grad_accum_steps)).backward()
            accum_counter += 1
            
            # Token accounting (per batch)
            total_tokens_processed += (batch['input_ids'] != -100).sum().item()
            
            did_optimizer_step = False
            if accum_counter % max(1, grad_accum_steps) == 0:
                if global_step >= warmup_steps:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                did_optimizer_step = True
            
            print(f"Epoch {epoch + 1}, Step {global_step} ({'opt-step' if did_optimizer_step else 'accumulating'}): Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item():.4f}, LR = {lr:.2e}")
            
            # Periodic evaluation
            if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
                model.eval()
                try:
                    val_batch = next(iter(validation_dataloader))
                    with torch.no_grad():
                        val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
                    validation_losses.append(val_loss.item())
                    print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item():.4f} ---")
                    generate_sample_text(model, tokenizer, device, start_context)
                except StopIteration:
                    print("Validation loader empty; skipping eval.")
                finally:
                    model.train()
            
            # Checkpoint rotation (keep last 2)
            if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
                base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
                prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"
                prev2_ckpt = "autosave_ckpt1_prev2_sydsgpt_v2_164m_trained_model_optimizer.pth"
                try:
                    if os.path.exists(prev2_ckpt):
                        os.remove(prev2_ckpt)
                except Exception:
                    pass
                try:
                    if os.path.exists(prev1_ckpt):
                        os.replace(prev1_ckpt, prev2_ckpt)
                except Exception:
                    pass
                try:
                    if os.path.exists(base_ckpt):
                        os.replace(base_ckpt, prev1_ckpt)
                except Exception:
                    pass
                torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
                print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt} | prev2 -> {prev2_ckpt}")
        
        # Flush leftover grads at epoch end (if any)
        if accum_counter % max(1, grad_accum_steps) != 0:
            if global_step >= warmup_steps:
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            optimizer.zero_grad(set_to_none=True)
            print("Flushed leftover accumulated gradients at epoch end.")
    
    return training_losses, validation_losses, total_tokens_processed, learning_rates

Practical tips

Loss scaling: dividing the loss by grad_accum_steps before .backward() keeps gradient magnitudes consistent with larger batch training.
Gradient clipping: apply only at optimizer step time to avoid clipping partial gradients repeatedly.
Checkpoint rotation: keeps disk usage bounded while preserving recent history for recovery.

Running the experiment and saving final weights

Iinstantiate SydsGPTv2 with a 164M-parameter configuration and compile it if possible.
Used GaLoreAdamW with weight_decay=0.05.
Ran train_model_v3 with grad_accum_steps = 64 and saved the final model as "sydsgpt_v2_164m_trained_model-11.8B.pth".

Code: training invocation and final save

from galore_torch import GaLoreAdamW
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)


num_epochs = 2
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v2(
    model,
    training_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 10000,
    start_context = "Once upon a time",
    tokenizer = enc,
    checkpoint_interval = 10000,
    total_steps_per_epoch = total_steps_per_epoch,
    warmup_steps = warmup_steps,
    initial_lr = initial_lr,
    peak_lr = peak_lr,
    min_lr = min_lr
)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
checkpoint = torch.load("autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)
model.to(device)

num_epochs = 1
grad_accum_steps = 64  # effective batch = BATCH_SIZE * grad_accum_steps
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v3(
    model,
    training_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 10000,
    start_context = "Once upon a time",
    tokenizer = enc,
    checkpoint_interval = 10000,
    total_steps_per_epoch = total_steps_per_epoch,
    warmup_steps = warmup_steps,
    initial_lr = initial_lr,
    peak_lr = peak_lr,
    min_lr = min_lr,
    grad_accum_steps = grad_accum_steps
)

torch.save(model.state_dict(), "sydsgpt_v2_164m_trained_model-11.8B.pth")

Notes

The notebook shows both and being used; is the final training function with gradient accumulation.
with yields an effective batch size of 128, which is a practical way to approximate larger-batch training on a single GPU.

Loading and generation

Code: loading the final checkpoint and generating text

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt_v2_164m_trained_model-11.8B.pth", map_location=device))
model.to(device)

from modules.Generate import generate, text_to_tokens, tokens_to_text

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")


input_text = "A deep neural network is a type of artificial neural network with multiple layers between the input and output layers, which allows it to learn hierarchical patterns in data."
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(model, input_tokens, 1000, SYDSGPT_CONFIG_V2_164M['context_length'], temperature = 1.5, top_k = 40)
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text:\n {output_text}")

What to watch during generation

Temperature and top-k: higher temperature and top-k produce more diverse outputs but can also increase incoherence.
Context length: ensure the input fits within context_length or is truncated appropriately.
Token-to-text mapping: use the same tiktoken encoder used during training to avoid tokenization mismatches.

Observations from the run

Throughput and runtime

Running ~12B tokens on a single 3080 Ti required careful memory management: Flash Attention, gradient accumulation, and mixed-precision-friendly flags were essential.
torch.compile can reduce Python overhead and improve steady-state throughput, but the first iteration includes compilation time. Measure both compile time and steady-state tokens/sec.
Parquet shards and memory-mapped HF datasets kept RAM usage low and allowed streaming large corpora without loading everything into memory.

Stability

LR warmup prevented early divergence. A small initial_lr and a short warmup window (2% of steps per epoch) stabilized the first phase.
Cosine decay provided a smooth annealing schedule across the full run.
Gradient clipping applied at optimizer step time helped avoid gradient explosions after warmup.

Practical trade-offs

Shard size: larger shards reduce file count but increase I/O per read; tune SHARD_SIZE_CHUNKS to your disk and training pattern.
Batch size vs. accumulation: accumulation increases effective batch size but increases wall-clock time per optimizer step; choose grad_accum_steps to balance memory and throughput.
Checkpoint cadence: frequent checkpoints increase disk usage and I/O; rotating saves keep recent history while bounding storage.

Lessons learned and recommendations

Data

Pre-tokenize and persist tokenized rows to avoid repeated tokenization and to get accurate token counts for planning.
Use deterministic sharding and manifest files for reproducibility.

Model

Flash Attention (or scaled_dot_product_attention) is a practical way to reduce memory pressure and increase throughput on consumer GPUs. Validate numerical parity with a baseline.

Training

Warmup + cosine decay is a robust schedule for long runs.
Gradient accumulation is essential for single-GPU large-scale pretraining. Ensure correct loss scaling and clipping semantics.
Use rotating checkpoints to limit disk usage while keeping recoverability.

Performance

torch.compile can help but measure compile overhead vs. steady-state gains.
Enable TF32 and set_float32_matmul_precision on Ampere+ GPUs for faster matmuls when acceptable.

Final thoughts

This part of the series demonstrates how careful engineering across the data pipeline, attention kernel, training loop, and runtime configuration makes large-scale pretraining feasible even on constrained hardware. The code provided is a practical, reproducible blueprint: tokenize once, shard into fixed-length chunks, stream shards with memory-mapped HF datasets, replace attention with a Flash Attention style kernel, compile the model when possible, and run a disciplined training loop with warmup, cosine decay, gradient accumulation, and rotating checkpoints.

Try It Yourself

The full notebook with all the steps, from preparing the corpus, data loaders, loss computation, pretraining loop, text sampling and generation, is available here:

SydsGPT pretraining on Large corpus Repository

Clone the repo, open the Jupyter notebook, and step through the code.

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

What comes next

Part 8 will focus on fine-tuning. Specifically on instruction fine‑tuning and alignment: curate and clean an instruction‑style dataset (paired prompts and high‑quality responses), normalize formatting and tokenization to match the pretraining pipeline, and split into train/validation shards for reproducible experiments. Experiment with lightweight adaptation methods first (LoRA/PEFT or adapters) to get fast iteration on learning rates, weight decay, and few‑epoch schedules before committing to full‑model fine‑tuning

Later, I will add tool calling for web search and build a RAG pipeline to interact with private data. The aim is a private assistant that respects privacy and delivers practical value, proving that small models can go far when engineered with care

Source Code

pretraining-largecorpus

Imports Overview¶

This cell sets up the core utilities needed for the data ingestion and preparation pipeline.

from datasets import Dataset: Provides the Dataset class (used for typing, inspection, and potential construction of new datasets later in the workflow).
load_dataset: Downloads and constructs Hugging Face datasets from remote hubs (used later for FineWeb, Wikipedia, and ArXiv corpora).
concatenate_datasets: Merges multiple homogeneous Dataset objects into one unified dataset (used after individual cleaning steps to form a combined corpus).
load_from_disk: Reloads previously persisted datasets (enables multi-stage processing without recomputation).
import random: Supplies PRNG utilities (later used for stochastic train/validation assignment when chunking token sequences and shuffling operations).

Why These Imports Are Here¶

They form the foundation for a multi-phase pipeline:

Load raw text corpora.
Normalize schema to a single text column.
Persist intermediate results for reproducibility and restartability.
Re-load, shuffle, subset, and concatenate.
Stream tokens into fixed-size chunks with probabilistic splitting (requires random).

No execution or side effects occur in this cell; it strictly prepares functionality used in subsequent cells.

In [1]:

from datasets import Dataset, load_dataset, concatenate_datasets, load_from_disk
import random

e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Dataset Ingestion: FineWeb, Wikipedia (EN 2023-11-01), ArXiv (Pile subset)¶

This stage downloads three large-scale text corpora via Hugging Face load_dataset using the train split for each:

FineWeb (HuggingFaceFW/fineweb, config: sample-10BT)
- High-quality web crawl subset.
- Contains multiple metadata columns; only text will be retained later.
Wikipedia (wikimedia/wikipedia, config: 20231101.en)
- Clean encyclopedic prose.
- Rich structured fields (e.g., id, url, title); we normalize to raw text.
ArXiv (timaeus/pile-arxiv)
- Scientific/technical writing.
- Complements general + encyclopedic domains with formal research style.

Why load them separately first?¶

Allows per-corpus cleaning (column pruning) before concatenation.
Avoids early memory pressure from merging heterogeneous schemas.
Facilitates caching: each dataset is stored once under ~/.cache/huggingface/datasets.

Performance / Memory Notes¶

Initial load may be disk + network bound; subsequent runs reuse cache.
If RAM constrained, consider:
- Using streaming=True and later materializing only needed samples.
- Subsetting via .select(...) before tokenization (already done later at 50% trim).
Order does not matter; each returns a standalone Dataset object.

Rationale for Multi-Corpus Mix¶

Web (diverse style) + encyclopedic (factual structure) + scientific (formal reasoning) improves stylistic robustness.
Balancing domains early avoids overfitting to a single register.

No side effects beyond network download and cache population occur here; mutation (column removal, shuffling, saving) is deferred to subsequent cells.

In [7]:

fineweb_dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train")
wikipedia_dataset = load_dataset("wikimedia/wikipedia", "20231101.en", split="train")
arxiv_dataset = load_dataset("timaeus/pile-arxiv", split="train")

Column Pruning / Schema Normalization (Next Code Cell)¶

The upcoming code cell (below this markdown) reduces each corpus to a single canonical text field:

In [8]:

fineweb_dataset = fineweb_dataset.remove_columns([col for col in fineweb_dataset.column_names if col != 'text'])
wikipedia_dataset = wikipedia_dataset.remove_columns([col for col in wikipedia_dataset.column_names if col != 'text'])
arxiv_dataset = arxiv_dataset.remove_columns([col for col in arxiv_dataset.column_names if col != 'text'])

Persist Normalized Corpora to Disk¶

This step serializes each individually cleaned Hugging Face Dataset (FineWeb, Wikipedia, ArXiv) to local storage under the data/ directory. After the prior schema normalization (only the text column retained in the previous code cell), saving achieves:

Why Save Individually?¶

Enables fast restart: subsequent runs skip remote download + column pruning by directly calling load_from_disk(...).
Modular pipeline stages: tokenization, shuffling, trimming, concatenation occur later without recomputing earlier ingestion work.
Caching granularity: you can delete or recompute one corpus without touching the others.
Debugging / inspection: load a single corpus to probe stats or quality before mixing.

File Layout¶

Each call creates a directory.

In [9]:

fineweb_dataset.save_to_disk('data/fineweb_dataset')
wikipedia_dataset.save_to_disk('data/wikipedia_dataset')
arxiv_dataset.save_to_disk('data/arxiv_dataset')

Saving the dataset (93/93 shards): 100%|██████████| 14868862/14868862 [02:27<00:00, 101009.83 examples/s]
Saving the dataset (40/40 shards): 100%|██████████| 6407814/6407814 [01:23<00:00, 76959.09 examples/s] 
Saving the dataset (10/10 shards): 100%|██████████| 100000/100000 [00:19<00:00, 5021.42 examples/s]

Reloading Normalized Datasets from Disk¶

This cell restores each previously saved, schema-normalized Hugging Face Dataset (FineWeb, Wikipedia, ArXiv) from local storage. The datasets were saved after column pruning, so each contains only the canonical text field.

Purpose¶

Fast Restart: Avoids repeating remote downloads and column normalization. Enables resuming the pipeline from disk.
Modular Processing: Each corpus can be independently inspected, shuffled, subsetted, or concatenated in later steps.
Reproducibility: Ensures that subsequent operations (shuffling, chunking, tokenization) use exactly the same data as prior runs.

File Paths¶

data/fineweb_dataset: FineWeb corpus, normalized to text column.
data/wikipedia_dataset: Wikipedia corpus, normalized to text column.
data/arxiv_dataset: ArXiv corpus, normalized to text column.

Next Steps¶

After loading, the datasets will be shuffled and trimmed (see the following code cell), then concatenated for further processing.

No mutation or side effects occur in this cell; it strictly loads datasets into memory for downstream use.

In [6]:

fineweb_dataset = load_from_disk('data/fineweb_dataset')
wikipedia_dataset = load_from_disk('data/wikipedia_dataset')
arxiv_dataset = load_from_disk('data/arxiv_dataset')

Shuffling and Trimming Each Corpus Before Concatenation¶

This code cell performs two key preprocessing steps on each normalized dataset (FineWeb, Wikipedia, ArXiv):

Shuffling: Randomizes the order of samples in each corpus using a fixed seed (seed=42) for reproducibility. Shuffling ensures that downstream splits (train/validation) and chunking do not inherit any ordering bias from the original datasets.
Trimming: Selects only the first 50% of each shuffled dataset. This reduces memory and compute requirements for subsequent steps, making the pipeline more manageable for experimentation or resource-constrained environments.

Why Shuffle and Trim Separately?¶

Shuffling before trimming ensures that the subset is a representative sample of the full corpus, not just the first half of the original ordering.
Trimming after shuffling allows for rapid prototyping and testing without processing the entire dataset.

Output¶

Each variable (fineweb_dataset, wikipedia_dataset, arxiv_dataset) now contains a shuffled and trimmed version of the original corpus, ready for concatenation into a single combined dataset.

Next Steps¶

The processed datasets will be concatenated in the following cell to form a unified corpus for tokenization and model training.

In [ ]:

fineweb_dataset = fineweb_dataset.shuffle(seed=42)
wikipedia_dataset = wikipedia_dataset.shuffle(seed=42)
arxiv_dataset = arxiv_dataset.shuffle(seed=42)
# trim 50% of the dataset
fineweb_dataset = fineweb_dataset.select(range(len(fineweb_dataset)//2))
wikipedia_dataset = wikipedia_dataset.select(range(len(wikipedia_dataset)//2))
arxiv_dataset = arxiv_dataset.select(range(len(arxiv_dataset)//2))

Concatenating Shuffled and Trimmed Datasets¶

This code cell merges the three preprocessed corpora—FineWeb, Wikipedia, and ArXiv—into a single Hugging Face Dataset using concatenate_datasets. Each input dataset has already been:

Normalized to contain only the text column.
Shuffled with a fixed seed for reproducibility.
Trimmed to the first 50% of samples for efficient experimentation.

Why Concatenate Now?¶

Unified Corpus: Combines diverse writing styles (web, encyclopedic, scientific) into one dataset for downstream tokenization and model training.
Consistent Schema: All datasets share the same column structure (text), enabling seamless merging.
Balanced Sampling: Shuffling and trimming ensure that each domain is fairly represented in the final mix.

Output¶

The resulting combined_dataset contains all rows from the three sources, ready for tokenization and chunking.

Next Steps¶

Save the combined dataset to disk for reproducibility and fast reloads.
Tokenize the unified corpus and prepare it for model training.

No mutation occurs to the original datasets; only a new combined dataset is created.

In [12]:

#Concatenate the datasets
combined_dataset = concatenate_datasets([fineweb_dataset, wikipedia_dataset, arxiv_dataset])

Saving the Combined Dataset to Disk¶

This code cell persists the unified Hugging Face Dataset—created by concatenating the shuffled and trimmed FineWeb, Wikipedia, and ArXiv corpora—to local storage at data/combined_dataset. Saving the combined dataset achieves several goals:

Why Save the Combined Dataset?¶

Fast Reloads: Enables rapid restart of the pipeline from the merged corpus, skipping all prior ingestion, normalization, shuffling, and trimming steps.
Reproducibility: Guarantees that downstream tokenization and chunking operate on exactly the same data as previous runs.
Modular Processing: Facilitates experimentation with tokenization, chunking, or model training without repeating earlier preprocessing.
Disk-Based Workflow: Reduces RAM requirements by allowing later stages to stream or memory-map the dataset from disk.

Output¶

The directory data/combined_dataset will contain the serialized dataset, ready for tokenization and chunking in subsequent steps.

No mutation occurs to the original datasets; only the combined dataset is saved.

In [13]:

combined_dataset.save_to_disk('data/combined_dataset')

Saving the dataset (71/71 shards): 100%|██████████| 10688338/10688338 [17:41<00:00, 10070.76 examples/s]

Tokenization and Disk Persistence of the Combined Dataset¶

This code cell performs two critical steps for preparing the unified corpus for model training:

Tokenization:
- Utilizes the Hugging Face tiktoken library with the GPT-2 tokenizer (enc).
- Applies the tokenize_and_save function to the loaded combined dataset.
- Each sample’s text is converted into a list of integer token IDs (input_ids) and its length (length).
- The original text column is removed (keep_text=False) to save disk space.
Saving to Disk:
- The tokenized dataset is serialized to data/combined_tokenized_dataset for fast reloads and reproducibility.
- This enables downstream chunking and training to operate directly on token sequences, bypassing repeated tokenization.

Outputs¶

dataset_tokenized: The Hugging Face Dataset containing tokenized samples (input_ids, length).
dataset_token_count: The total number of tokens in the combined corpus (for reporting and scaling experiments).

Why This Step Matters¶

Efficiency: Tokenization is compute-intensive; saving results avoids redundant work.
Modularity: Downstream steps (chunking, batching, training) can be restarted from the tokenized dataset.
Disk-Based Workflow: Reduces RAM requirements and supports scalable data streaming.

Next Steps¶

The tokenized dataset will be chunked into fixed-length sequences and split into train/validation sets for model training.

In [1]:

import tiktoken
enc = tiktoken.get_encoding("gpt2")

In [15]:

from typing import Optional, Tuple
from datasets import Dataset

# Tokenize a HF Dataset and save to disk. Returns (tokenized_dataset, total_tokens)
def tokenize_and_save(
    dataset: Dataset,
    out_dir: str,
    text_column: str = 'text',
    keep_text: bool = False,
    batch_size: int = 1000,
    num_proc: Optional[int] = None,
) -> Tuple[Dataset, int]:
    """
    - Tokenizes each row's text using the global `enc` (tiktoken GPT-2).
    - Adds 'input_ids' (List[int]) and 'length' (int) columns.
    - Optionally removes the original 'text' column to save space.
    - Saves the resulting dataset to `out_dir`.
    Returns the tokenized dataset and the total token count.
    """

    def tok_batch(batch):
        texts = batch[text_column]
        input_ids = [enc.encode(t, allowed_special={'<|endoftext|>'}) for t in texts]
        lengths = [len(ids) for ids in input_ids]
        return {'input_ids': input_ids, 'length': lengths}

    remove_cols = None if keep_text else [text_column]

    tokenized = dataset.map(
        tok_batch,
        batched=True,
        batch_size=batch_size,
        num_proc=num_proc,
        remove_columns=remove_cols,
        desc=f"Tokenizing -> {out_dir}",
    )

    # Compute total tokens efficiently by summing the 'length' column
    total_tokens = int(sum(tokenized['length']))

    # Persist to disk
    tokenized.save_to_disk(out_dir)

    return tokenized, total_tokens

In [16]:

dataset_path = 'data/combined_dataset'
tokenized_dataset_path = 'data/combined_tokenized_dataset'

# Load datasets from disk
dataset = load_from_disk(dataset_path)

dataset_tokenized, dataset_token_count = tokenize_and_save(dataset, tokenized_dataset_path, keep_text=False, batch_size=1000, num_proc=None)


print('Tokenized dataset sizes (rows):', {
    'combined_rows': len(dataset_tokenized),
})
print('Per-dataset token counts:', {
    'combined_tokens': dataset_token_count,
})

Tokenizing -> data/combined_tokenized_dataset: 100%|██████████| 10688338/10688338 [1:08:59<00:00, 2582.15 examples/s]
Saving the dataset (67/67 shards): 100%|██████████| 10688338/10688338 [02:11<00:00, 81362.85 examples/s]

Tokenized dataset sizes (rows): {'combined_rows': 10688338}
Per-dataset token counts: {'combined_tokens': 8327777943}

Streaming Chunking and Parquet Sharding of Tokenized Dataset¶

This code cell implements a memory-efficient streaming chunker for the tokenized dataset, writing fixed-length (2048-token) chunks to Parquet files for both training and validation splits. The process avoids flattening the entire corpus into RAM, instead incrementally buffering tokens and flushing shards to disk.

Key Steps:¶

Directory Setup & Cleanup
- Creates output directories for train/val shards.
- Removes any existing .parquet files to avoid mixing old and new data.
Chunking Logic
- Streams over the tokenized dataset row-by-row.
- Buffers tokens until at least one full chunk (CHUNK_SIZE = 2048) is available.
- Each chunk is randomly assigned to train (80%) or validation (20%) split.
Shard Writing
- Chunks are accumulated in batches (SHARD_SIZE_CHUNKS = 25,000).
- Once a batch is full, it is written to a Parquet file using PyArrow.
- This process repeats until all data is processed.
Finalization
- Any remaining chunks are flushed to disk.
- Reports the total number of train/val chunks written and leftover tokens (not enough to form a full chunk).

Why This Matters:¶

Scalability: Handles massive datasets without exceeding RAM limits.
Fast Loading: Parquet shards can be memory-mapped and loaded efficiently for training.
Balanced Splits: Ensures train/val splits are randomized at the chunk level, not by document.

Output:¶

Parquet files in data/combined_chunks_train_parquet and data/combined_chunks_val_parquet, each containing lists of 2048-token chunks.
Printed summary of chunk counts and dropped tokens.

This cell prepares the data for efficient downstream training with PyTorch DataLoaders.

In [2]:

import os
import random
from pathlib import Path
from glob import glob
import pyarrow as pa
import pyarrow.parquet as pq
import torch
from torch.utils.data import Dataset as TorchDataset, DataLoader
from datasets import load_from_disk, Dataset

e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

In [3]:

# Config
SOURCE_PATH = 'data/combined_tokenized_dataset'   # variable-length input_ids
OUT_TRAIN_DIR = Path('data/combined_chunks_train_parquet')
OUT_VAL_DIR = Path('data/combined_chunks_val_parquet')
CHUNK_SIZE = 2048
TRAIN_PROB = 0.8          # approximate 80/20 split at chunk level
SHARD_SIZE_CHUNKS = 25000  # number of chunks per parquet shard (tune for memory/disk throughput)
BATCH_SIZE = 2
SEED = 42

Memory-safe streaming chunker: write 2048-token shards to Parquet, then build loaders (no attention_mask)¶

If flattening into one giant list runs out of memory, stream rows and write fixed-size chunks incrementally to Parquet shards. Then, load the shards as a Hugging Face Dataset and build PyTorch DataLoaders with externally shifted labels.

In [ ]:

random.seed(SEED)
OUT_TRAIN_DIR.mkdir(parents=True, exist_ok=True)
OUT_VAL_DIR.mkdir(parents=True, exist_ok=True)

# Optional: clean old shards (only .parquet files)
for f in list(OUT_TRAIN_DIR.glob('*.parquet')) + list(OUT_VAL_DIR.glob('*.parquet')):
    try:
        f.unlink()
    except Exception:
        pass

# Helper to write a shard of chunks to Parquet
# chunks: List[List[int]] (all must be CHUNK_SIZE long)
def write_parquet_shard(chunks, out_dir: Path, shard_idx: int):
    if not chunks:
        return
    array = pa.array(chunks, type=pa.list_(pa.int32()))
    table = pa.table({'input_ids': array})
    pq.write_table(table, out_dir / f'part-{shard_idx:05d}.parquet')

# Stream over dataset and produce fixed-size chunks
buf = []  # token buffer
train_batch, val_batch = [], []
train_shard, val_shard = 0, 0
train_count, val_count = 0, 0

src = load_from_disk(SOURCE_PATH)
print('Streaming rows:', len(src))

for row in src:
    toks = row['input_ids']
    if not toks:
        continue
    buf.extend(toks)
    while len(buf) >= CHUNK_SIZE:
        chunk = buf[:CHUNK_SIZE]
        del buf[:CHUNK_SIZE]
        if random.random() < TRAIN_PROB:
            train_batch.append(chunk)
            train_count += 1
            if len(train_batch) >= SHARD_SIZE_CHUNKS:
                write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
                train_shard += 1
                train_batch = []
        else:
            val_batch.append(chunk)
            val_count += 1
            if len(val_batch) >= SHARD_SIZE_CHUNKS:
                write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)
                val_shard += 1
                val_batch = []

# Flush leftovers
write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)

print({'train_chunks_written': train_count, 'val_chunks_written': val_count, 'leftover_tokens_dropped': len(buf)})

e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Streaming rows: 10688338
{'train_chunks_written': 3254400, 'val_chunks_written': 811897, 'leftover_tokens_dropped': 1687}
{'train_chunks_written': 3254400, 'val_chunks_written': 811897, 'leftover_tokens_dropped': 1687}

Downloading data: 100%|██████████| 131/131 [00:00<00:00, 13507.73files/s]

Generating train split: 3254400 examples [01:20, 40405.54 examples/s]

Downloading data: 100%|██████████| 33/33 [00:00<00:00, 30460.39files/s]
Downloading data: 100%|██████████| 33/33 [00:00<00:00, 30460.39files/s]
Generating train split: 811897 examples [00:18, 43109.80 examples/s]

{'train_rows': 3254400, 'val_rows': 811897}
Batch shapes: {'input_ids': torch.Size([8, 2048]), 'labels': torch.Size([8, 2048])}
input_ids[0][:10]: tensor([   70,   459,  1839, 21740,   379,   262,  2351,   412,   396,  6048])
labels[0][:10]:    tensor([  459,  1839, 21740,   379,   262,  2351,   412,   396,  6048,    69])

Building PyTorch DataLoaders from Parquet-Sharded Hugging Face Datasets¶

This code cell performs the following steps to prepare efficient PyTorch DataLoaders for training and validation:

Load Parquet Shards as Hugging Face Datasets
- Uses glob to collect all .parquet files from the train and validation chunk directories.
- Loads these files with Dataset.from_parquet, enabling memory-mapped access to large datasets without loading everything into RAM.
Print Dataset Sizes
- Reports the number of rows (chunks) in both train and validation sets for sanity checking.
PyTorch Dataset Wrapper
- Defines FixedLenHFDataset, a wrapper that converts each Hugging Face dataset row (a list of token IDs) into a PyTorch tensor.
- Ensures each sample is of fixed length (CHUNK_SIZE), suitable for transformer training.
Custom Collate Function for Language Modeling
- Implements collate_shift, which stacks batches and shifts targets by one position (next-token prediction).
- The last target token is set to -100 to mask it from loss computation.
Instantiate Datasets and DataLoaders
- Wraps the train and validation Hugging Face datasets with FixedLenHFDataset.
- Creates PyTorch DataLoaders with appropriate batch size and shuffling for training, and disables shuffling for validation.
- Applies the custom collate function to produce input_ids and targets tensors for each batch.

Why This Matters¶

Scalability: Handles billions of tokens efficiently by streaming from disk.
Correct Labeling: Ensures next-token prediction targets are properly aligned for autoregressive training.
Modularity: Separates data loading, batching, and collation for easy experimentation and debugging.

Output¶

training_dataloader and validation_dataloader objects, ready for use in the training loop.
Printed summary of dataset sizes for verification.

In [11]:

# Build HF Datasets from Parquet shards (memory-mapped; avoids loading everything at once)
train_parquet_files = sorted(glob(str(OUT_TRAIN_DIR / '*.parquet')))
val_parquet_files = sorted(glob(str(OUT_VAL_DIR / '*.parquet')))

train_hfds = Dataset.from_parquet(train_parquet_files)
val_hfds = Dataset.from_parquet(val_parquet_files)

print({'train_rows': len(train_hfds), 'val_rows': len(val_hfds)})

# Torch wrappers and DataLoaders (no attention_mask), with external shift in collate
class FixedLenHFDataset(TorchDataset):
    def __init__(self, hf_ds: Dataset):
        self.ds = hf_ds
    def __len__(self):
        return len(self.ds)
    def __getitem__(self, idx):
        ids = self.ds[idx]['input_ids']
        return torch.tensor(ids, dtype=torch.long)

def collate_shift(batch):
    x = torch.stack(batch)      # (B, CHUNK_SIZE)
    y = x.clone()
    y[:, :-1] = x[:, 1:]
    y[:, -1] = -100
    return {'input_ids': x, 'targets': y}

train_fixed = FixedLenHFDataset(train_hfds)
val_fixed = FixedLenHFDataset(val_hfds)

training_dataloader = DataLoader(train_fixed, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_shift)
validation_dataloader = DataLoader(val_fixed, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_shift)

{'train_rows': 3254400, 'val_rows': 811897}

Inspecting a Single Training Batch¶

This code cell retrieves the next batch from the training_dataloader and prints key diagnostics:

Batch Shapes: Displays the tensor shapes for both input_ids and targets in the batch. This confirms the batch size and sequence length (should match BATCH_SIZE and CHUNK_SIZE).
First 10 Tokens: Shows the first 10 token IDs from the first sample in both input_ids and targets. This helps verify that the input and target tensors are correctly aligned for next-token prediction (target is input shifted left by one, with the last token masked as -100).

Why This Matters¶

Sanity Check: Ensures that the DataLoader, collation function, and chunking pipeline are producing batches in the expected format for language modeling.
Debugging: Quick inspection of token values and shapes can catch errors in preprocessing, batching, or collation before training begins.

No mutation or side effects occur; this cell is purely for inspection and debugging.

In [ ]:

batch = next(iter(training_dataloader))
print('Batch shapes:', {k: v.shape for k, v in batch.items()})
print('input_ids[0][:10]:', batch['input_ids'][0][:10])
print('targets[0][:10]:   ', batch['targets'][0][:10])

Model Imports, Instantiation, and Device Setup¶

This code cell performs the following steps to prepare the SydsGPT model for evaluation or training:

Imports
- Imports PyTorch (torch) and its neural network module (torch.nn).
Model Definition and Configuration
- Imports the SydsGPT model class from the local model.SydsGPT module.
- Defines the configuration dictionary SYDSGPT_CONFIG_164M for a 164M parameter GPT-style model, specifying:
  - vocab_size: Size of the tokenizer vocabulary (GPT-2 default: 50257).
  - context_length: Maximum sequence length (2048 tokens).
  - embedding_dim: Embedding dimension (768).
  - num_heads: Number of attention heads (12).
  - num_layers: Number of transformer blocks (12).
  - dropout: Dropout rate (0.1).
  - qkv_bias: Whether to use bias in QKV projections (False).
Model Instantiation and Device Placement
- Sets a manual random seed (torch.manual_seed(246)) for reproducibility.
- Instantiates the SydsGPT model with the specified configuration.
- Detects the available device (GPU if available, otherwise CPU) and moves the model to that device.
- Sets the model to evaluation mode (model.eval()), disabling dropout and other training-specific behaviors.

Why This Matters¶

Reproducibility: Setting the random seed ensures consistent initialization.
Device Awareness: Automatically uses GPU acceleration if available for faster inference/training.
Model Readiness: The instantiated model is ready for forward passes, parameter inspection, or further fine-tuning.

No training or inference occurs in this cell; it strictly prepares the model and device context for downstream use.

In [1]:

import torch
import torch.nn as nn

In [9]:

from model.SydsGPT import SydsGPT

SYDSGPT_CONFIG_164M = {
    "vocab_size" : 50257,
    "context_length" : 2048,
    "embedding_dim" : 768,
    "num_heads" : 12,
    "num_layers" : 12,
    "dropout" : 0.1,
    "qkv_bias" : False
}

torch.manual_seed(246)
model = SydsGPT(SYDSGPT_CONFIG_164M)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

Out[9]:

SydsGPT(
  (token_embedding): Embedding(50257, 768)
  (position_embedding): Embedding(2048, 768)
  (drop_embedding): Dropout(p=0.1, inplace=False)
  (transformer_blocks): Sequential(
    (0): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (2): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (3): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (4): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (5): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (6): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (7): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (8): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (9): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (10): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (11): TransformerBlock(
      (attention): MultiHeadAttention(
        (weight_query): Linear(in_features=768, out_features=768, bias=False)
        (weight_key): Linear(in_features=768, out_features=768, bias=False)
        (weight_value): Linear(in_features=768, out_features=768, bias=False)
        (dropout): Dropout(p=0.1, inplace=False)
        (output_projection): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (final_layer_norm): LayerNorm()
  (output_projection): Linear(in_features=768, out_features=50257, bias=False)
)

Calculating and Displaying Total Parameters in SydsGPT Model¶

This code cell computes the total number of trainable parameters in the instantiated SydsGPT model and prints the result. This is a crucial diagnostic step for understanding the model’s scale and verifying that the architecture matches expectations.

What Happens in This Cell¶

Parameter Counting:
Uses a generator expression to iterate over all parameters in the model object and sums their element counts (numel()), which gives the total number of scalar weights and biases in the model.
Output:
Prints the total parameter count in a human-readable format, allowing you to confirm the model’s size (e.g., for reporting, scaling experiments, or comparing with published architectures).

Why This Matters¶

Model Size Verification:
Ensures that the SydsGPT model has been instantiated with the correct configuration and matches the intended parameter count (e.g., 164M for the provided config).
Resource Planning:
Knowing the parameter count helps estimate memory requirements and training time.

No mutation or side effects occur; this cell is purely for inspection and reporting.

In [10]:

total_parameters = sum(parameter.numel() for parameter in model.parameters())
print(f"Total Parameters in SydsGPT Model: {total_parameters}")

Total Parameters in SydsGPT Model: 163795968

FlashAttention Module: Efficient Causal Self-Attention Layer¶

This code cell defines the FlashAttention class, an efficient implementation of multi-head causal self-attention using PyTorch’s built-in scaled_dot_product_attention primitive. The module is designed for transformer architectures and supports dropout during training.

Key Components¶

Initialization (__init__)
- embedding_dim: Dimensionality of input embeddings.
- num_heads: Number of attention heads.
- head_dim: Computed as embedding_dim // num_heads.
- qkv: Linear layer projecting input to concatenated queries, keys, and values.
- out_proj: Linear layer projecting the output of attention back to the embedding dimension.
- dropout: Dropout probability applied to attention weights during training.
Forward Pass (forward)
- Projects input x to queries, keys, and values.
- Reshapes and permutes tensors to [batch, heads, seq, head_dim] format.
- Applies causal self-attention using torch.nn.functional.scaled_dot_product_attention with causal masking (is_causal=True).
- Applies dropout only during training.
- Projects the attended output back to the original embedding dimension.

Why Use This Module?¶

Performance: Leverages PyTorch’s optimized attention kernel for speed and memory efficiency.
Causality: Ensures autoregressive masking for language modeling tasks.
Modularity: Can be plugged into transformer blocks for building GPT-style models.

No side effects or external dependencies are introduced; this cell strictly defines the FlashAttention module for use in subsequent model construction.

In [2]:

class FlashAttention(nn.Module):
    def __init__(self, embedding_dim, num_heads, dropout=0.1):
        super().__init__()
        assert embedding_dim % num_heads == 0, "embedding_dim must be divisible by num_heads"
        self.embedding_dim = embedding_dim
        self.num_heads = num_heads
        self.head_dim = embedding_dim // num_heads
        self.dropout = dropout

        self.qkv = nn.Linear(embedding_dim, 3 * embedding_dim)
        self.out_proj = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, x):
        batch_size, seq_length, _ = x.shape
        qkv = self.qkv(x)
        qkv = qkv.view(batch_size, seq_length, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        queries, keys, values = qkv
        dropout = 0.0 if not self.training else self.dropout
        context_vectors = torch.nn.functional.scaled_dot_product_attention(queries, keys, values, attn_mask = None, dropout_p = dropout, is_causal = True)
        context_vectors = context_vectors.transpose(1, 2).contiguous().view(batch_size, seq_length, self.embedding_dim)
        context_vectors = self.out_proj(context_vectors)
        return context_vectors

In [ ]:

flash_attention = FlashAttention(embedding_dim = SYDSGPT_CONFIG_164M['embedding_dim'], num_heads = SYDSGPT_CONFIG_164M['num_heads'], dropout=SYDSGPT_CONFIG_164M['dropout']).to(device)
embeddings = torch.randn((8, SYDSGPT_CONFIG_164M['context_length'], SYDSGPT_CONFIG_164M['embedding_dim']), device=device)
print('Embeddings shape:', embeddings.shape)
output = flash_attention(embeddings)
print('Flash Attention output shape:', output.shape)

TransformerBlockv2: Residual Block with FlashAttention, LayerNorm, and FeedForward¶

This cell defines the TransformerBlockv2 class, a modular transformer block for GPT-style architectures. It integrates efficient causal self-attention (via FlashAttention), layer normalization, and a feed-forward network, all wrapped with residual connections and dropout for stability.

Components¶

Attention Layer:
Uses FlashAttention for fast, memory-efficient multi-head causal self-attention.
- Inputs: normalized embeddings.
- Outputs: contextually mixed representations.
Layer Normalization:
- layer_norm1 before attention.
- layer_norm2 before feed-forward.
- Improves training stability and convergence.
FeedForward Network:
- Applies a position-wise MLP to each token embedding.
- Adds non-linearity and increases model capacity.
Dropout:
- Applied after attention and feed-forward for regularization.
Residual Connections:
- Each sub-layer (attention, feed-forward) is wrapped with a skip connection to preserve input information and ease gradient flow.

Forward Pass¶

Normalize input and apply attention, then dropout and residual add.
Normalize again, apply feed-forward, then dropout and residual add.
Output is ready for stacking in a transformer.

Usage¶

This block is designed to be stacked multiple times in a transformer model (see SydsGPTv2 in the next cell). It is compatible with the configuration dictionary used throughout the notebook.

No side effects or external dependencies beyond the imported modules.

In [3]:

from modules.LayerNorm import LayerNorm
from modules.FeedForward import FeedForward
import torch.nn as nn

class TransformerBlockv2(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.attention = FlashAttention(
            embedding_dim = config["embedding_dim"],
            num_heads = config["num_heads"],
            dropout = config["dropout"],
        )
        self.layer_norm1 = LayerNorm(config["embedding_dim"])
        self.feed_forward = FeedForward(config)
        self.layer_norm2 = LayerNorm(config["embedding_dim"])
        self.dropout = nn.Dropout(config["dropout"])

    def forward(self, x):
        shortcut = x
        x = self.layer_norm1(x)
        x = self.attention(x)
        x = self.dropout(x)
        x = x + shortcut
        shortcut = x
        x = self.layer_norm2(x)
        x = self.feed_forward(x)
        x = self.dropout(x)
        x = x + shortcut
        return x

SydsGPTv2 Model Definition: GPT-2 Style Transformer with FlashAttention Blocks¶

This code cell defines the SydsGPTv2 class, a GPT-style transformer model designed for efficient autoregressive language modeling. The architecture incorporates several key components:

Components¶

Token Embedding:
Maps input token IDs to dense vectors of size embedding_dim.
Position Embedding:
Adds positional information to each token using learned embeddings for sequence positions up to context_length.
Embedding Dropout:
Applies dropout to the sum of token and position embeddings for regularization.
Stacked Transformer Blocks:
Uses a sequence of TransformerBlockv2 modules, each containing:
- FlashAttention for fast, memory-efficient causal self-attention.
- LayerNorm and FeedForward sublayers with residual connections and dropout.
Final LayerNorm:
Normalizes the output of the last transformer block to stabilize training.
Output Projection:
Projects the final hidden states to logits over the vocabulary for next-token prediction.

Forward Pass¶

Input Processing:
- Converts input token IDs to embeddings.
- Adds position embeddings.
- Applies dropout.
Transformer Stack:
- Passes the embeddings through a stack of TransformerBlockv2 modules.
Output:
- Applies final layer normalization.
- Projects to vocabulary logits for language modeling.

Usage¶

This model is suitable for training and inference on large-scale text corpora. It leverages efficient attention mechanisms and modular design for scalability and performance.

No side effects or external dependencies beyond the imported modules.

In [4]:

import torch
import torch.nn as nn
from modules.TransformerBlock import TransformerBlock
from modules.LayerNorm import LayerNorm

class SydsGPTv2(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
        self.position_embedding = nn.Embedding(config["context_length"], config["embedding_dim"])
        self.drop_embedding = nn.Dropout(config["dropout"])
        self.transformer_blocks = nn.Sequential(*[TransformerBlockv2(config) for _ in range(config["num_layers"])])
        self.final_layer_norm = LayerNorm(config["embedding_dim"])
        self.output_projection = nn.Linear(config["embedding_dim"], config["vocab_size"], bias = False)
    
    def forward(self, input):
        batch_size, seq_length = input.shape
        token_embeddings = self.token_embedding(input)
        position_embeddings = self.position_embedding(torch.arange(seq_length, device=input.device))
        x = token_embeddings + position_embeddings
        x = self.drop_embedding(x)
        x = self.transformer_blocks(x)
        x = self.final_layer_norm(x)
        logits = self.output_projection(x)
        return logits

SydsGPTv2 Model Configuration (164M Parameters)¶

This cell defines the configuration dictionary SYDSGPT_CONFIG_V2_164M for the SydsGPTv2 model, specifying architectural hyperparameters for a GPT-2 style transformer with approximately 164 million parameters.

Configuration Fields¶

vocab_size:
The size of the tokenizer vocabulary. For GPT-2, this is typically 50,257 tokens.
context_length:
The maximum sequence length (number of tokens) the model can process in a single forward pass. Here, set to 2,048 tokens.
embedding_dim:
The dimensionality of token and position embeddings, as well as the hidden states throughout the model. Set to 768, matching GPT-2 small/medium.
num_heads:
The number of attention heads in each multi-head self-attention block. Set to 12.
num_layers:
The number of stacked transformer blocks in the model. Set to 12.
dropout:
Dropout probability applied throughout the model for regularization. Set to 0.1.
qkv_bias:
Whether to use bias terms in the query/key/value projections. Set to False for this configuration.

Usage¶

This configuration dictionary is passed to the SydsGPTv2 model constructor in the next cell, ensuring consistent architecture and hyperparameters for training and evaluation.

No computation or side effects occur in this cell; it strictly defines model hyperparameters.

In [5]:

SYDSGPT_CONFIG_V2_164M = {
    "vocab_size" : 50257,
    "context_length" : 2048,
    "embedding_dim" : 768,
    "num_heads" : 12,
    "num_layers" : 12,
    "dropout" : 0.1,
    "qkv_bias" : False
}

SydsGPTv2 Model Instantiation and Device Placement¶

This cell performs the following steps to prepare the SydsGPTv2 model for training or inference:

Random Seed Initialization:
- Sets the PyTorch random seed to 246 for reproducibility, ensuring consistent model weight initialization across runs.
Model Instantiation:
- Constructs a new instance of the SydsGPTv2 model using the configuration dictionary SYDSGPT_CONFIG_V2_164M.
- The configuration specifies key hyperparameters such as vocabulary size, context length, embedding dimension, number of heads/layers, dropout rate, and QKV bias.
Device Placement:
- Moves the model to the appropriate device (cuda if a GPU is available, otherwise cpu) for efficient computation.
Evaluation Mode:
- Sets the model to evaluation mode (model.eval()), disabling dropout and other training-specific behaviors.
- This is useful for inference or validation, but can be switched back to training mode (model.train()) as needed.

Why This Matters¶

Reproducibility:
Ensures that model initialization is consistent for debugging and experimentation.
Performance:
Automatically utilizes available hardware acceleration (GPU) for faster computation.
Readiness:
The model is fully instantiated and placed on the correct device, ready for forward passes, parameter inspection, or further fine-tuning.

No training or inference occurs in this cell; it strictly prepares the model and device context for downstream use.

In [ ]:

torch.manual_seed(246)
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model = model.to(device)
model.eval()

Out[ ]:

SydsGPTv2(
  (token_embedding): Embedding(50257, 768)
  (position_embedding): Embedding(2048, 768)
  (drop_embedding): Dropout(p=0.1, inplace=False)
  (transformer_blocks): Sequential(
    (0): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (2): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (3): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (4): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (5): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (6): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (7): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (8): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (9): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (10): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (11): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (final_layer_norm): LayerNorm()
  (output_projection): Linear(in_features=768, out_features=50257, bias=False)
)

Calculating and Displaying Total Parameters in SydsGPT V2 Model¶

This code cell computes the total number of trainable parameters in the instantiated SydsGPTv2 model and prints the result. This is an important diagnostic step for verifying the model’s scale and ensuring the architecture matches expectations.

What Happens in This Cell¶

Parameter Counting:
Iterates over all parameters in the model object and sums their element counts (numel()), which gives the total number of scalar weights and biases in the model.
Output:
Prints the total parameter count in a human-readable format, allowing you to confirm the model’s size (e.g., for reporting, scaling experiments, or comparing with published architectures).

Why This Matters¶

Model Size Verification:
Ensures that the SydsGPTv2 model has been instantiated with the correct configuration and matches the intended parameter count (e.g., 164M for the provided config).
Resource Planning:
Knowing the parameter count helps estimate memory requirements and training time.

No mutation or side effects occur; this cell is purely for inspection and reporting.

In [12]:

total_parameters = sum(parameter.numel() for parameter in model.parameters())
print(f"Total Parameters in SydsGPT V2 Model: {total_parameters}")

Total Parameters in SydsGPT V2 Model: 163823616

Model Compilation with `torch.compile` and Performance Optimization Flags¶

This cell compiles the SydsGPTv2 model using PyTorch’s torch.compile for accelerated training and inference. It also sets several performance-related flags to maximize throughput on compatible hardware.

Key Steps¶

Performance Flags (CUDA/CPU):
- Enables TensorFloat-32 (TF32) for matrix multiplications and cuDNN operations (Ampere+ GPUs).
- Activates cuDNN benchmarking for optimal kernel selection.
- Sets matmul precision to 'high' for improved numerical accuracy (if supported).
Model Compilation:
- Attempts to compile the model using torch.compile with the specified backend (inductor) and mode (default).
- Supports dynamic shapes if needed (set via dynamic_shapes).
- Handles compilation errors gracefully, falling back to eager mode if compilation fails.
Diagnostics:
- Prints status messages indicating whether compilation succeeded and which backend/mode was used.
- Notes that the first iteration may include compilation overhead, but subsequent steps will be faster.

Why This Matters¶

Speed: Compiling the model can significantly accelerate training and inference, especially on modern GPUs.
Hardware Utilization: Performance flags ensure that the model leverages the fastest available kernels and precision modes.
Robustness: The cell is designed to work on both CPU and GPU, and will not crash if certain features are unavailable.

No training or inference occurs in this cell; it strictly prepares the model for efficient execution in subsequent steps.

In [35]:

# Compile model with torch.compile and set performance flags
import torch
import contextlib

# Optional performance knobs (safe on Ampere+ GPUs; harmless on CPU)
try:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cudnn.benchmark = True
except Exception:
    pass

# Prefer higher-precision matmul kernels if available on your hardware
with contextlib.suppress(Exception):
    torch.set_float32_matmul_precision('high')  # 'high' or 'medium'

# Choose a compile configuration
compile_backend = 'inductor'          # default backend
compile_mode = 'default'              # try 'reduce-overhead' or 'max-autotune' later
dynamic_shapes = False                # set True if you plan to change batch size frequently
compile_ok = False


try:
    model = torch.compile(model, backend=compile_backend, mode=compile_mode, dynamic=dynamic_shapes)
    compile_ok = True
    print(f"Model compiled with torch.compile (backend={compile_backend}, mode={compile_mode}, dynamic={dynamic_shapes})")
    print("Note: First iteration includes compile time; subsequent steps are faster.")
except Exception as e:
    print("torch.compile failed; falling back to eager. Error:\n", e)

Model compiled with torch.compile (backend=inductor, mode=default, dynamic=False)
Note: First iteration includes compile time; subsequent steps are faster.

Learning Rate Schedule and Training Step Calculation¶

This cell sets up the learning rate schedule and computes the number of training steps per epoch for the SydsGPTv2 training loop. It defines three key learning rate values:

Initial Learning Rate (initial_lr): The starting learning rate for the warmup phase.
Peak Learning Rate (peak_lr): The maximum learning rate reached after warmup.
Minimum Learning Rate (min_lr): The lowest learning rate used during cosine decay, set to 10% of the peak.

It then calculates:

Total Training Steps Per Epoch (total_steps_per_epoch): The number of batches in one epoch, based on the size of the training dataset and batch size.
Warmup Steps (warmup_steps): The number of steps over which the learning rate linearly increases from initial_lr to peak_lr, set to 2% of the steps per epoch.

All computed values are printed for verification. These parameters are used in the training loop to control learning rate scheduling and progress tracking.

In [12]:

initial_lr = 1e-6
peak_lr = 1e-4
min_lr = 0.1 * peak_lr

print('Initial LR:', initial_lr)
print('Peak LR:', peak_lr)
print('Min LR:', min_lr)

total_steps_per_epoch = len(train_hfds) // BATCH_SIZE
print('Total training steps per epoch:', total_steps_per_epoch)

warmup_steps = int(total_steps_per_epoch * .02)
print('Warmup steps:', warmup_steps)

Initial LR: 1e-06
Peak LR: 0.0001
Min LR: 1e-05
Total training steps per epoch: 1627200
Warmup steps: 32544

Advanced Training Loop: Cosine Decay, Warmup, Rotating Checkpoints¶

This cell defines train_model_v2, an advanced training loop for SydsGPTv2 with several key features:

Features¶

Cosine Decay + Warmup Learning Rate:
- Linearly increases LR from initial_lr to peak_lr over warmup_steps.
- After warmup, applies cosine decay from peak_lr to min_lr for the remainder of training.
- LR is updated per batch step.
Rotating Checkpoints:
- Saves model and optimizer state every checkpoint_interval steps.
- Keeps last 2 checkpoints (base_ckpt, prev1_ckpt, prev2_ckpt) for recovery.
Periodic Evaluation:
- Evaluates on validation set every evaluation_frequency steps.
- Logs validation loss and generates sample text.
Token Accounting:
- Tracks total tokens processed for reporting and scaling.

Inputs¶

model, training_dataloader, validation_dataloader, optimizer, device
Training hyperparameters: num_epochs, evaluation_frequency, start_context, tokenizer, checkpoint_interval
LR schedule: total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr

Outputs¶

Lists of training/validation losses, total tokens processed, learning rates

Usage¶

Call this function to train SydsGPTv2 with efficient scheduling, checkpointing. Suitable for large-scale distributed training and experimentation.

No side effects outside checkpoint files and console logging.

In [13]:

import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text

def train_model_v2(model, training_dataloader, validation_dataloader, optimizer, device,
                   num_epochs, evaluation_frequency, start_context,
                   tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr):
    
    training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
    total_tokens_processed, global_step = 0, -1
    total_training_steps = num_epochs * total_steps_per_epoch
    lr_increment = (peak_lr - initial_lr) / warmup_steps

    for epoch in range(num_epochs):
        model.train()
        for batch in training_dataloader:
            optimizer.zero_grad()
            global_step += 1
            if global_step < warmup_steps:
                lr = initial_lr + global_step * lr_increment
            else:
                progress = (global_step - warmup_steps) / (total_training_steps - warmup_steps)
                lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr
                learning_rates.append(lr)
                loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
                loss.backward()

                if global_step >= warmup_steps:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm = 1.0)
                
                optimizer.step()
                training_losses.append(loss.item())
                total_tokens_processed += (batch['input_ids'] != -100).sum().item()
                print(f"Epoch {epoch + 1}, Step {global_step}: Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item()}")

            if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
                model.eval()
                val_batch = next(iter(validation_dataloader))
                with torch.no_grad():
                    val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
                validation_losses.append(val_loss.item())
                print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item()} ---")
                generate_sample_text(model, tokenizer, device, start_context)
                model.train()
            
            if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
                base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
                prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"

                try:
                    if os.path.exists(prev1_ckpt):
                        os.remove(prev1_ckpt)
                except Exception:
                    pass

                try:
                    if os.path.exists(base_ckpt):
                        os.replace(base_ckpt, prev1_ckpt)
                except Exception:
                    pass

                torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
                print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt}")


    return training_losses, validation_losses, total_tokens_processed, learning_rates

GaLoreAdamW Optimizer Instantiation for SydsGPTv2¶

This cell initializes the optimizer for training the SydsGPTv2 model using the GaLoreAdamW optimizer from the galore_torch library. GaLoreAdamW is an efficient AdamW variant that leverages low-rank gradient updates to reduce memory usage and accelerate training, making it suitable for large-scale transformer models.

What Happens in This Cell¶

Import:
Imports GaLoreAdamW from the galore_torch package.
Optimizer Setup:
Instantiates the optimizer with the model’s parameters and a weight decay of 0.05 for regularization.
- model.parameters(): Supplies all trainable parameters of SydsGPTv2.
- weight_decay=0.05: Applies L2 regularization to help prevent overfitting.

Why Use GaLoreAdamW?¶

Memory Efficiency:
Reduces optimizer state memory footprint, enabling training of larger models or bigger batches.
Performance:
Maintains AdamW’s adaptive learning rate and weight decay benefits while optimizing for speed and scale.
Compatibility:
Drop-in replacement for standard AdamW; integrates seamlessly with PyTorch training loops.

Usage¶

The resulting optimizer object is used in subsequent training cells to update model weights during backpropagation.

No training or mutation occurs in this cell; it strictly prepares the optimizer for downstream use.

In [ ]:

from galore_torch import GaLoreAdamW
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)

Training SydsGPTv2: Cosine Decay, Warmup, Evaluation, and Rotating Checkpoints¶

This cell launches the training loop for SydsGPTv2 using the advanced train_model_v2 function. It orchestrates the following:

Features¶

Cosine Decay + Warmup Learning Rate:
- Starts with a low initial learning rate (initial_lr), linearly increases to a peak (peak_lr) over warmup_steps, then decays to a minimum (min_lr) using a cosine schedule.
- Learning rate is updated every batch.
Rotating Checkpoints:
- Saves model and optimizer state every checkpoint_interval steps.
- Keeps the last two checkpoints for recovery (autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth and its previous version).
Periodic Evaluation:
- Every evaluation_frequency steps, evaluates on the validation set and prints validation loss.
- Generates sample text using the current model state for qualitative inspection.
Token Accounting:
- Tracks the total number of tokens processed for reporting and scaling.

Inputs¶

model: SydsGPTv2 instance, already moved to the correct device.
training_dataloader, validation_dataloader: PyTorch DataLoaders for train/val splits.
optimizer: GaLoreAdamW optimizer for efficient memory usage.
device: CUDA or CPU device.
num_epochs: Number of epochs to train.
evaluation_frequency: Steps between validation/evaluation.
start_context: Initial prompt for sample text generation.
tokenizer: GPT-2 tokenizer (enc).
checkpoint_interval: Steps between checkpoint saves.
total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr: Learning rate schedule parameters.

Outputs¶

training_losses: List of training loss values per step.
validation_losses: List of validation loss values at evaluation intervals.
total_tokens_processed: Total tokens seen during training.
learning_rates: List of learning rates used per step.

Usage¶

This cell is the main entry point for model training. It provides robust scheduling, checkpointing, and evaluation, suitable for large-scale experiments and recovery from interruptions.

No side effects outside checkpoint files and console logging.

In [ ]:

num_epochs = 2
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v2(
    model,
    training_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 10000,
    start_context = "Once upon a time",
    tokenizer = enc,
    checkpoint_interval = 10000,
    total_steps_per_epoch = total_steps_per_epoch,
    warmup_steps = warmup_steps,
    initial_lr = initial_lr,
    peak_lr = peak_lr,
    min_lr = min_lr
)

Model Loading, Checkpoint Restoration, and Device Placement for SydsGPTv2¶

This cell performs the following steps to restore a previously trained SydsGPTv2 model and optimizer state for further training or inference:

Imports and Device Setup:
- Imports the GaLoreAdamW optimizer from the galore_torch package.
- Detects the available device (cuda if a GPU is present, otherwise cpu) and prints the device being used.
Model Instantiation and Checkpoint Loading:
- Instantiates a new SydsGPTv2 model using the configuration dictionary SYDSGPT_CONFIG_V2_164M.
- Loads the model weights from the latest rotating checkpoint file (autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth), mapping tensors to the detected device.
- Loads the model state dictionary into the model instance.
Optimizer Restoration and Device Placement:
- Instantiates the GaLoreAdamW optimizer with the model’s parameters and a weight decay of 0.05.
- Moves all optimizer state tensors to the correct device to ensure compatibility with the model.
- Moves the model itself to the detected device.

Why This Matters¶

Checkpoint Recovery:
Enables seamless resumption of training or evaluation from the last saved state, preserving both model weights and optimizer momentum.
Device Consistency:
Ensures that all tensors (model and optimizer) are placed on the same device, avoiding runtime errors and maximizing performance.
Experiment Continuity:
Facilitates iterative experimentation, fine-tuning, or evaluation without retraining from scratch.

No training or inference occurs in this cell; it strictly restores model and optimizer state for downstream use.

In [12]:

from galore_torch import GaLoreAdamW


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
checkpoint = torch.load("autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
for state in optimizer.state.values():
    for k, v in state.items():
        if isinstance(v, torch.Tensor):
            state[k] = v.to(device)
model.to(device)

Using device: cuda

e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\galore_torch\adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Out[12]:

SydsGPTv2(
  (token_embedding): Embedding(50257, 768)
  (position_embedding): Embedding(2048, 768)
  (drop_embedding): Dropout(p=0.1, inplace=False)
  (transformer_blocks): Sequential(
    (0): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (2): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (3): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (4): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (5): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (6): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (7): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (8): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (9): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (10): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (11): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (final_layer_norm): LayerNorm()
  (output_projection): Linear(in_features=768, out_features=50257, bias=False)
)

Saving SydsGPTv2 Model Weights to Disk (5.1 Billion Tokens Trained)¶

This cell saves the current state dictionary of the SydsGPTv2 model to disk as "sydsgpt_v2_164m_trained_model-5.1B.pth". This checkpoint represents the model after training on approximately 5.1 billion tokens.

What Happens in This Cell¶

Model Serialization:
Uses torch.save to serialize the model’s parameters (state_dict) to a file. This allows for later restoration, fine-tuning, or inference without retraining.
File Naming Convention:
The filename includes the model type, parameter count (164M), and the number of tokens processed (5.1B), making it easy to track training progress and checkpoint lineage.

Why Save Model Weights?¶

Experiment Tracking:
Preserves the model state at a specific training milestone for reproducibility and comparison.
Recovery & Deployment:
Enables resuming training, performing evaluation, or deploying the model for inference.
Version Control:
Facilitates managing multiple checkpoints corresponding to different stages of training.

No side effects occur beyond writing the checkpoint file to disk.

In [ ]:

torch.save(model.state_dict(), "sydsgpt_v2_164m_trained_model-5.1B.pth")

Model Loading and Inference: SydsGPTv2 with 5.1B Token Checkpoint¶

This cell demonstrates how to restore a previously trained SydsGPTv2 model from disk and perform text generation using the loaded weights. The workflow includes:

Imports and Device Setup
- Imports the GaLoreAdamW optimizer from galore_torch.
- Detects the available device (cuda if a GPU is present, otherwise cpu) and prints the device being used.
Model Instantiation and Checkpoint Loading
- Instantiates a new SydsGPTv2 model using the configuration dictionary SYDSGPT_CONFIG_V2_164M.
- Loads the model weights from the "sydsgpt_v2_164m_trained_model-5.1B.pth" checkpoint, mapping tensors to the detected device.
- Instantiates the GaLoreAdamW optimizer with the model’s parameters and a weight decay of 0.05.
- Moves the model to the detected device.
Usage
- The restored model is ready for further training, evaluation, or inference.
- This cell is typically followed by text generation or validation steps.

Why This Matters¶

Checkpoint Recovery: Enables seamless resumption of training or inference from a specific milestone.
Device Consistency: Ensures all tensors are placed on the correct device for efficient computation.
Experiment Continuity: Facilitates iterative experimentation and deployment without retraining from scratch.

No training or inference occurs in this cell; it strictly restores model and optimizer state for downstream use.

In [15]:

from galore_torch import GaLoreAdamW


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt_v2_164m_trained_model-5.1B.pth", map_location=device))
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
model.to(device)

Using device: cuda

e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\galore_torch\adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(

Out[15]:

SydsGPTv2(
  (token_embedding): Embedding(50257, 768)
  (position_embedding): Embedding(2048, 768)
  (drop_embedding): Dropout(p=0.1, inplace=False)
  (transformer_blocks): Sequential(
    (0): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (2): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (3): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (4): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (5): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (6): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (7): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (8): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (9): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (10): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (11): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (final_layer_norm): LayerNorm()
  (output_projection): Linear(in_features=768, out_features=50257, bias=False)
)

Text Generation with SydsGPTv2: Sampling from a Trained Model¶

This cell demonstrates how to generate text using the SydsGPTv2 model and a GPT-2 tokenizer. The workflow includes:

Imports and Tokenizer Setup
- Imports the generate, text_to_tokens, and tokens_to_text functions from the modules.Generate module.
- Initializes the GPT-2 tokenizer using the tiktoken library.
Input Preparation
- Defines an input prompt: "Once upon a time there was a kingdom far away where".
- Converts the input text to token IDs using the tokenizer and moves them to the appropriate device (CPU or GPU).
Text Generation
- Calls the generate function to sample 200 new tokens from the model, using a context length of 2048, a temperature of 1.5 (for more creative outputs), and top-k sampling with k=40 (restricts sampling to the top 40 probable tokens at each step).
Output Decoding and Display
- Converts the generated token IDs back to human-readable text.
- Prints the generated output for inspection.

Why This Matters¶

Qualitative Evaluation:
Enables rapid inspection of the model’s generative capabilities after training or fine-tuning.
Sampling Controls:
Temperature and top-k parameters allow for tuning creativity and diversity in generated text.
End-to-End Demonstration:
Shows the complete process from prompt to generated output, suitable for inference, validation, or deployment.

No training or mutation occurs in this cell; it strictly performs inference and displays the result.

In [19]:

from modules.Generate import generate, text_to_tokens, tokens_to_text

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")


input_text = "Once upon a time there was a kingdom far away where"
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(model, input_tokens, 200, SYDSGPT_CONFIG_V2_164M['context_length'], temperature = 1.5, top_k = 40)
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text:\n {output_text}")

Output Text:
 Once upon a time there was a kingdom far away where the men were not allowed away when a time a God wanted for their own, as long as the sons did away the children in the church. If the Holy Spirit were against them they would go to bed at all costs before they came to bed and have their daily breaded supper for their father or father to drink before they had left. (As many of the more people of these places had died while sleeping in the bed with them that day before their next feast.)
Now in 1878, a small minority church was officially created in order to provide good health on sickness, although a few small groups had become the majority of the society at this time. For good example, from 1925 as the outbreak of plague hit, from 1915, there went some to work out at one-way hospitals but did not work out at both first and then to a third, then, finally the death had happened and so took three more. Then, from 1931 to 1944 and then from 1944 through 1958 at this

Advanced Training Loop with Gradient Accumulation: `train_model_v3`¶

This cell defines train_model_v3, an enhanced training loop for SydsGPTv2 that supports gradient accumulation in addition to cosine decay learning rate scheduling, warmup, periodic evaluation, and rotating checkpoints.

Key Features¶

Gradient Accumulation:
- Allows effective batch sizes larger than GPU memory by accumulating gradients over multiple mini-batches (grad_accum_steps).
- Scales loss by 1/grad_accum_steps before backward to keep gradients invariant.
- Performs optimizer step and gradient zeroing only after the specified number of accumulation steps.
Cosine Decay + Warmup Learning Rate:
- Linearly increases LR from initial_lr to peak_lr over warmup_steps.
- Applies cosine decay from peak_lr to min_lr for the remainder of training.
- LR is updated per batch step.
Rotating Checkpoints:
- Saves model and optimizer state every checkpoint_interval steps.
- Keeps the last three checkpoints (base_ckpt, prev1_ckpt, prev2_ckpt) for robust recovery.
Periodic Evaluation:
- Evaluates on the validation set every evaluation_frequency steps.
- Logs validation loss and generates sample text for qualitative inspection.
Token Accounting:
- Tracks total tokens processed for reporting and scaling.

Inputs¶

model, training_dataloader, validation_dataloader, optimizer, device
Training hyperparameters: num_epochs, evaluation_frequency, start_context, tokenizer, checkpoint_interval
LR schedule: total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr
grad_accum_steps: Number of mini-batches to accumulate before optimizer step

Outputs¶

Lists of training/validation losses, total tokens processed, learning rates

Usage¶

Call this function to train SydsGPTv2 with efficient scheduling, checkpointing, and large effective batch sizes via gradient accumulation. Suitable for large-scale distributed training and experimentation.

No side effects outside checkpoint files and console logging.

In [12]:

import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text

def train_model_v3(model, training_dataloader, validation_dataloader, optimizer, device,
                   num_epochs, evaluation_frequency, start_context,
                   tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr,
                   grad_accum_steps: int = 1):
    """
    Train with cosine decay + warmup and optional gradient accumulation.

    Notes:
    - LR/warmup here are updated per batch (DataLoader iteration). If you want warmup
      in optimizer steps, compute warmup_steps accordingly (divide by grad_accum_steps).
    - loss is scaled by 1/grad_accum_steps before backward to keep gradients invariant.
    """
    training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
    total_tokens_processed, global_step = 0, -1
    total_training_steps = num_epochs * total_steps_per_epoch
    lr_increment = (peak_lr - initial_lr) / max(1, warmup_steps)
    accum_counter = 0
    
    optimizer.zero_grad(set_to_none=True)
    
    for epoch in range(num_epochs):
        model.train()
        for batch in training_dataloader:
            global_step += 1
            # Learning rate schedule per batch step
            if global_step < warmup_steps:
                lr = initial_lr + global_step * lr_increment
            else:
                progress = (global_step - warmup_steps) / max(1, (total_training_steps - warmup_steps))
                lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
            for pg in optimizer.param_groups:
                pg['lr'] = lr
            learning_rates.append(lr)
            
            # Forward + backward (accumulated)
            loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
            training_losses.append(loss.item())  # log unscaled loss
            (loss / max(1, grad_accum_steps)).backward()
            accum_counter += 1
            
            # Token accounting (per batch)
            total_tokens_processed += (batch['input_ids'] != -100).sum().item()
            
            did_optimizer_step = False
            if accum_counter % max(1, grad_accum_steps) == 0:
                if global_step >= warmup_steps:
                    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                did_optimizer_step = True
            
            print(f"Epoch {epoch + 1}, Step {global_step} ({'opt-step' if did_optimizer_step else 'accumulating'}): Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item():.4f}, LR = {lr:.2e}")
            
            # Periodic evaluation
            if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
                model.eval()
                try:
                    val_batch = next(iter(validation_dataloader))
                    with torch.no_grad():
                        val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
                    validation_losses.append(val_loss.item())
                    print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item():.4f} ---")
                    generate_sample_text(model, tokenizer, device, start_context)
                except StopIteration:
                    print("Validation loader empty; skipping eval.")
                finally:
                    model.train()
            
            # Checkpoint rotation (keep last 2)
            if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
                base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
                prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"
                prev2_ckpt = "autosave_ckpt1_prev2_sydsgpt_v2_164m_trained_model_optimizer.pth"
                try:
                    if os.path.exists(prev2_ckpt):
                        os.remove(prev2_ckpt)
                except Exception:
                    pass
                try:
                    if os.path.exists(prev1_ckpt):
                        os.replace(prev1_ckpt, prev2_ckpt)
                except Exception:
                    pass
                try:
                    if os.path.exists(base_ckpt):
                        os.replace(base_ckpt, prev1_ckpt)
                except Exception:
                    pass
                torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
                print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt} | prev2 -> {prev2_ckpt}")
        
        # Flush leftover grads at epoch end (if any)
        if accum_counter % max(1, grad_accum_steps) != 0:
            if global_step >= warmup_steps:
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            optimizer.zero_grad(set_to_none=True)
            print("Flushed leftover accumulated gradients at epoch end.")
    
    return training_losses, validation_losses, total_tokens_processed, learning_rates

Training SydsGPTv2 with Gradient Accumulation: One Epoch, Large Effective Batch Size¶

This cell launches the advanced training loop for SydsGPTv2 using the train_model_v3 function, which supports gradient accumulation for large effective batch sizes. The workflow includes:

Epoch and Gradient Accumulation Setup
- Sets num_epochs = 1 for a single epoch of training.
- Sets grad_accum_steps = 64, meaning gradients are accumulated over 64 mini-batches before each optimizer step. This increases the effective batch size to BATCH_SIZE * grad_accum_steps, enabling training with larger batches than fit in GPU memory.
Training Loop Invocation
- Calls train_model_v3 with all required arguments:
  - model, training_dataloader, validation_dataloader, optimizer, device
  - Learning rate schedule parameters: initial_lr, peak_lr, min_lr, warmup_steps, total_steps_per_epoch
  - Evaluation and checkpoint intervals: every 10,000 steps
  - start_context: Initial prompt for sample text generation
  - tokenizer: GPT-2 tokenizer (enc)
  - grad_accum_steps: Number of mini-batches to accumulate before optimizer step
Features of train_model_v3
- Cosine Decay + Warmup Learning Rate: Linearly increases LR during warmup, then applies cosine decay.
- Gradient Accumulation: Scales loss and accumulates gradients, performing optimizer step only after grad_accum_steps.
- Rotating Checkpoints: Saves model and optimizer state every 10,000 steps, keeping the last three checkpoints for recovery.
- Periodic Evaluation: Evaluates on validation set and generates sample text every 10,000 steps.
- Token Accounting: Tracks total tokens processed for reporting.
Outputs
- Returns lists of training and validation losses, total tokens processed, and learning rates used per step.

Why Use Gradient Accumulation?¶

Memory Efficiency: Enables training with very large effective batch sizes, even on limited GPU memory.
Stability: Larger batches can improve gradient estimates and training stability.
Scalability: Facilitates scaling experiments and matches distributed training setups.

No side effects occur outside checkpoint files and console logging. This cell is the main entry point for large-batch training with robust scheduling and checkpointing.

In [ ]:

num_epochs = 1
grad_accum_steps = 64  # effective batch = BATCH_SIZE * grad_accum_steps
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v3(
    model,
    training_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 10000,
    start_context = "Once upon a time",
    tokenizer = enc,
    checkpoint_interval = 10000,
    total_steps_per_epoch = total_steps_per_epoch,
    warmup_steps = warmup_steps,
    initial_lr = initial_lr,
    peak_lr = peak_lr,
    min_lr = min_lr,
    grad_accum_steps = grad_accum_steps
)

Saving SydsGPTv2 Model Weights to Disk (11.8 Billion Tokens Trained)¶

This cell saves the current state dictionary of the SydsGPTv2 model to disk as "sydsgpt_v2_164m_trained_model-11.8B.pth". This checkpoint represents the model after training on approximately 11.8 billion tokens.

What Happens in This Cell¶

Model Serialization:
Uses torch.save to serialize the model’s parameters (state_dict) to a file. This allows for later restoration, fine-tuning, or inference without retraining.
File Naming Convention:
The filename includes the model type, parameter count (164M), and the number of tokens processed (11.8B), making it easy to track training progress and checkpoint lineage.

Why Save Model Weights?¶

Experiment Tracking:
Preserves the model state at a specific training milestone for reproducibility and comparison.
Recovery & Deployment:
Enables resuming training, performing evaluation, or deploying the model for inference.
Version Control:
Facilitates managing multiple checkpoints corresponding to different stages of training.

No side effects occur beyond writing the checkpoint file to disk.

In [17]:

torch.save(model.state_dict(), "sydsgpt_v2_164m_trained_model-11.8B.pth")

In [6]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt_v2_164m_trained_model-11.8B.pth", map_location=device))
model.to(device)

Using device: cuda

Out[6]:

SydsGPTv2(
  (token_embedding): Embedding(50257, 768)
  (position_embedding): Embedding(2048, 768)
  (drop_embedding): Dropout(p=0.1, inplace=False)
  (transformer_blocks): Sequential(
    (0): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (2): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (3): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (4): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (5): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (6): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (7): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (8): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (9): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (10): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (11): TransformerBlockv2(
      (attention): FlashAttention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (layer_norm1): LayerNorm()
      (feed_forward): FeedForward(
        (layers): Sequential(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU()
          (2): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
      (layer_norm2): LayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (final_layer_norm): LayerNorm()
  (output_projection): Linear(in_features=768, out_features=50257, bias=False)
)

In [ ]:

from modules.Generate import generate, text_to_tokens, tokens_to_text

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")


input_text = "A deep neural network is a type of artificial neural network with multiple layers between the input and output layers, which allows it to learn hierarchical patterns in data."
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(model, input_tokens, 1000, SYDSGPT_CONFIG_V2_164M['context_length'], temperature = 1.5, top_k = 40)
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text:\n {output_text}")

Output Text:
Once upon a time there was a kingdom far away where a man lived! So now if we are a nation that believes the things in their flesh the people will change their destiny in their hands.” And what do children who feel safe go to seek out more good souls have faith and want to learn to grow in our bodies and our bodies.”
There are numerous spiritual leaders at the moment that try to see us that may give us hope and desire, and we can begin to believe in others. We believe in our worth. God speaks up in our souls and our minds to tell them to take us first and to make our souls stronger and fuller. When someone begins asking this question on his own or her mind he tells them he knows that he believes with his heart in his faith. If he answers this question for the first time he is open and content, but also open and not content on wanting his heart for his soul as there could be more than hope. There are two common ways to find spiritual growth in our bodies that make these soul renews.
1 John 2.15 (I will give up only to one who thinks that all the people of heaven shall have such thoughts so I will never take no away that I will give up my mind.)
2. John 1.28 is another great source of spiritual learning for a true Christian with a faith to trust in the Lord Himself. God said. “As was taught in your holy books” (Ps. 8:6b) ‘No man shall possess any spirit among the Gentiles’ (II Corinthians 9:31-33). He gave a message to his enemies and was instructed according to an invitation that was given at Pentecost (the day the Lord spoke before his first wife, Ephesians); this he had said. It is also called the ‘Father,” (Ps. 25). In this passage Jesus told the children to read by a book because they feared that they had a good spiritual state; he said that they didn’t have enough spirit to understand that this meant the work of a perfect God. ”For we will be in an environment of confusion, confusion is not of the Bible” (Gen. 1, 11). His message comes as he told his sons: If we keep everything separate and not keep what the sons of men thought of in an empty land it will not be to give God the kingdom of heaven above us” or so he tells his sons to remember that the good will to make them believe. A believer may still remain in this world, it is with faith when our flesh does not give itself and this is a good time in life when our faith is broken . There God would know when anyone will trust that God will provide for him as well. At this point I had only said this in a previous work I had before. So let me take on it: Let the Lord help us keep and renew God” (Ps. 8). Notice this: We go back to a few pages here and read for our daily activities that take us beyond those of the living Word, His teachings are still important but relevant, there are numerous others that come into touch, in this book Jesus gives the same answers to each believer. He says it in Scripture: We must stay and renew every word; and if you do that you might learn that it is the right way to say that He will come into the world. We all must think carefully before us. God’s help us continually as there is always in our life, the will of His people is His name to give all the words and knowledge we have.
Ps. 6.5 And He says: All things are to keep your mouth from my mouth and into the eye” (Ps. 1-10) He told the children to study before the Lord that He should keep you from your mouth; if you keep everything separate the word, your eyes and will know that it is our first way to the word he said that everything is for my mouth. Then Jesus said to his children the same saying to him. These children know that “the one who will give me first and my child and my home.” The children must now be used to live as you tell them: The Father and the Mother will be your enemies at heaven. The Child’s words make me want in prayer to hear your words and to make them believe it will bring your life; let me teach the two to speak from it. These Wordings will not always result in an end as He told the children in this work; but because He gave them in a way as if on every page it will bring them on to the Lord. If the Spirit, by His Holy Presence and His Word does not see you there will be great need for you will remember that it always gives our first and for this we may take part in it. At these things as is always said

Part 7: Pretraining on 12B tokens on a single consumer gpu: what it takes

Overview

Dataset assembly and tokenization

Chunking into fixed-length shards for streaming

Model architecture and Flash Attention

Compilation and runtime flags

Training loop: warmup, cosine decay, gradient accumulation, and checkpoints

Running the experiment and saving final weights

Loading and generation

Observations from the run

Lessons learned and recommendations

Final thoughts

Try It Yourself

SydsGPT pretraining on Large corpus Repository

Build It Yourself

What comes next

Source Code

Imports Overview¶

Why These Imports Are Here¶

Dataset Ingestion: FineWeb, Wikipedia (EN 2023-11-01), ArXiv (Pile subset)¶

Why load them separately first?¶

Performance / Memory Notes¶

Rationale for Multi-Corpus Mix¶

Column Pruning / Schema Normalization (Next Code Cell)¶

Persist Normalized Corpora to Disk¶

Why Save Individually?¶

File Layout¶

Reloading Normalized Datasets from Disk¶

Purpose¶

File Paths¶

Next Steps¶

Shuffling and Trimming Each Corpus Before Concatenation¶

Why Shuffle and Trim Separately?¶

Output¶

Next Steps¶

Concatenating Shuffled and Trimmed Datasets¶

Why Concatenate Now?¶

Output¶

Next Steps¶

Saving the Combined Dataset to Disk¶

Why Save the Combined Dataset?¶

Output¶

Tokenization and Disk Persistence of the Combined Dataset¶

Outputs¶

Why This Step Matters¶

Next Steps¶

Streaming Chunking and Parquet Sharding of Tokenized Dataset¶

Key Steps:¶

Why This Matters:¶

Output:¶

Memory-safe streaming chunker: write 2048-token shards to Parquet, then build loaders (no attention_mask)¶

Building PyTorch DataLoaders from Parquet-Sharded Hugging Face Datasets¶

Why This Matters¶

Output¶

Inspecting a Single Training Batch¶

Why This Matters¶

Model Imports, Instantiation, and Device Setup¶

Why This Matters¶

Calculating and Displaying Total Parameters in SydsGPT Model¶

What Happens in This Cell¶

Why This Matters¶

FlashAttention Module: Efficient Causal Self-Attention Layer¶

Key Components¶

Why Use This Module?¶

TransformerBlockv2: Residual Block with FlashAttention, LayerNorm, and FeedForward¶

Components¶

Forward Pass¶

Usage¶

SydsGPTv2 Model Definition: GPT-2 Style Transformer with FlashAttention Blocks¶

Components¶

Forward Pass¶

Usage¶

SydsGPTv2 Model Configuration (164M Parameters)¶

Configuration Fields¶

Usage¶

SydsGPTv2 Model Instantiation and Device Placement¶

Why This Matters¶

Calculating and Displaying Total Parameters in SydsGPT V2 Model¶

What Happens in This Cell¶

Why This Matters¶

Model Compilation with `torch.compile` and Performance Optimization Flags¶

Advanced Training Loop with Gradient Accumulation: `train_model_v3`¶