In this part, I scaled a full pretraining pipeline: a ~10B-token corpus, pre-tokenization and chunking for streaming, a Flash Attention replacement inside the GPT blocks, training-loop features (warmup, cosine decay, gradient accumulation), torch.compile for runtime speedups, and GaloreAdamW as the optimizer. I then ran a long single‑GPU pretraining run (~12B tokens over ~11 days on an NVIDIA 3080 Ti). This post documents the full process, explains the design choices, and shows the exact code so readers can reproduce and adapt each step.

Overview
What this part covers
- Dataset assembly and preprocessing: combining multiple corpora, pre-tokenization with
tiktoken, and chunking into fixed-length shards stored as Parquet for streaming. - Model changes: replacing standard attention with a Flash Attention style implementation using
torch.nn.functional.scaled_dot_product_attention, and wiring that into a new transformer block and model class. - Training loop improvements: LR warmup, cosine decay, gradient accumulation, gradient clipping, periodic evaluation, and rotating checkpoints.
- Performance engineering:
torch.compileusage and runtime flags, mixed-precision considerations, and optimizer selection (GaLoreAdamW). - Run summary and practical lessons from training ~12B tokens on a single 3080 Ti.
Below I walk through each stage and include the notebook code so you can see exactly what was done.
Dataset assembly and tokenization
Goal: build a large, mixed corpus and convert it into tokenized, fixed-length chunks that can be streamed efficiently during training.
Key ideas
- Keep raw text columns minimal to save space.
- Pre-tokenize with
tiktoken(GPT-2 encoding) to get deterministic token counts. - Stream token lists into a buffer and emit fixed-size chunks (here
CHUNK_SIZE = 2048) into Parquet shards for efficient, memory-mapped reads.
Code: dataset loading, trimming, concatenation, and saving
fineweb_dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train")
wikipedia_dataset = load_dataset("wikimedia/wikipedia", "20231101.en", split="train")
arxiv_dataset = load_dataset("timaeus/pile-arxiv", split="train")
fineweb_dataset = fineweb_dataset.remove_columns([col for col in fineweb_dataset.column_names if col != 'text'])
wikipedia_dataset = wikipedia_dataset.remove_columns([col for col in wikipedia_dataset.column_names if col != 'text'])
arxiv_dataset = arxiv_dataset.remove_columns([col for col in arxiv_dataset.column_names if col != 'text'])
fineweb_dataset.save_to_disk('data/fineweb_dataset')
wikipedia_dataset.save_to_disk('data/wikipedia_dataset')
arxiv_dataset.save_to_disk('data/arxiv_dataset')
fineweb_dataset = load_from_disk('data/fineweb_dataset')
wikipedia_dataset = load_from_disk('data/wikipedia_dataset')
arxiv_dataset = load_from_disk('data/arxiv_dataset')
fineweb_dataset = fineweb_dataset.shuffle(seed=42)
wikipedia_dataset = wikipedia_dataset.shuffle(seed=42)
arxiv_dataset = arxiv_dataset.shuffle(seed=42)
# trim 50% of the dataset
fineweb_dataset = fineweb_dataset.select(range(len(fineweb_dataset)//2))
wikipedia_dataset = wikipedia_dataset.select(range(len(wikipedia_dataset)//2))
arxiv_dataset = arxiv_dataset.select(range(len(arxiv_dataset)//2))
#Concatenate the datasets
combined_dataset = concatenate_datasets([fineweb_dataset, wikipedia_dataset, arxiv_dataset])
combined_dataset.save_to_disk('data/combined_dataset')
Code: tokenization with tiktoken and saving tokenized dataset
import tiktoken
enc = tiktoken.get_encoding("gpt2")
from typing import Optional, Tuple
from datasets import Dataset
# Tokenize a HF Dataset and save to disk. Returns (tokenized_dataset, total_tokens)
def tokenize_and_save(
dataset: Dataset,
out_dir: str,
text_column: str = 'text',
keep_text: bool = False,
batch_size: int = 1000,
num_proc: Optional[int] = None,
) -> Tuple[Dataset, int]:
"""
- Tokenizes each row's text using the global `enc` (tiktoken GPT-2).
- Adds 'input_ids' (List[int]) and 'length' (int) columns.
- Optionally removes the original 'text' column to save space.
- Saves the resulting dataset to `out_dir`.
Returns the tokenized dataset and the total token count.
"""
def tok_batch(batch):
texts = batch[text_column]
input_ids = [enc.encode(t, allowed_special={'<|endoftext|>'}) for t in texts]
lengths = [len(ids) for ids in input_ids]
return {'input_ids': input_ids, 'length': lengths}
remove_cols = None if keep_text else [text_column]
tokenized = dataset.map(
tok_batch,
batched=True,
batch_size=batch_size,
num_proc=num_proc,
remove_columns=remove_cols,
desc=f"Tokenizing -> {out_dir}",
)
# Compute total tokens efficiently by summing the 'length' column
total_tokens = int(sum(tokenized['length']))
# Persist to disk
tokenized.save_to_disk(out_dir)
return tokenized, total_tokens
dataset_path = 'data/combined_dataset'
tokenized_dataset_path = 'data/combined_tokenized_dataset'
# Load datasets from disk
dataset = load_from_disk(dataset_path)
dataset_tokenized, dataset_token_count = tokenize_and_save(dataset, tokenized_dataset_path, keep_text=False, batch_size=1000, num_proc=None)
print('Tokenized dataset sizes (rows):', {
'combined_rows': len(dataset_tokenized),
})
print('Per-dataset token counts:', {
'combined_tokens': dataset_token_count,
})
Why this matters
- Pre-tokenization gives you an exact token count and lets you reason about how many chunks and epochs you can run.
- Saving tokenized rows to disk avoids repeated tokenization during experiments and makes preprocessing reproducible.
Chunking into fixed-length shards for streaming
Goal: convert variable-length token lists into fixed-length chunks (2048 tokens) and write them into Parquet shards for efficient streaming and reproducible sampling.
Design choices
- Buffering: accumulate tokens across rows until you can emit a full chunk.
- Shard sizing: choose a shard size (
SHARD_SIZE_CHUNKS) that balances file count and I/O throughput. - Train/val split: random assignment at chunk emission time to get an approximate 80/20 split.
- Parquet: memory-mapped reads via HF
Dataset.from_parquetavoid loading everything into RAM.
Code: chunking pipeline and DataLoader wrappers
import os
import random
from pathlib import Path
from glob import glob
import pyarrow as pa
import pyarrow.parquet as pq
import torch
from torch.utils.data import Dataset as TorchDataset, DataLoader
from datasets import load_from_disk, Dataset
# Config
SOURCE_PATH = 'data/combined_tokenized_dataset' # variable-length input_ids
OUT_TRAIN_DIR = Path('data/combined_chunks_train_parquet')
OUT_VAL_DIR = Path('data/combined_chunks_val_parquet')
CHUNK_SIZE = 2048
TRAIN_PROB = 0.8 # approximate 80/20 split at chunk level
SHARD_SIZE_CHUNKS = 25000 # number of chunks per parquet shard (tune for memory/disk throughput)
BATCH_SIZE = 2
SEED = 42
random.seed(SEED)
OUT_TRAIN_DIR.mkdir(parents=True, exist_ok=True)
OUT_VAL_DIR.mkdir(parents=True, exist_ok=True)
# Optional: clean old shards (only .parquet files)
for f in list(OUT_TRAIN_DIR.glob('*.parquet')) + list(OUT_VAL_DIR.glob('*.parquet')):
try:
f.unlink()
except Exception:
pass
# Helper to write a shard of chunks to Parquet
# chunks: List[List[int]] (all must be CHUNK_SIZE long)
def write_parquet_shard(chunks, out_dir: Path, shard_idx: int):
if not chunks:
return
array = pa.array(chunks, type=pa.list_(pa.int32()))
table = pa.table({'input_ids': array})
pq.write_table(table, out_dir / f'part-{shard_idx:05d}.parquet')
# Stream over dataset and produce fixed-size chunks
buf = [] # token buffer
train_batch, val_batch = [], []
train_shard, val_shard = 0, 0
train_count, val_count = 0, 0
src = load_from_disk(SOURCE_PATH)
print('Streaming rows:', len(src))
for row in src:
toks = row['input_ids']
if not toks:
continue
buf.extend(toks)
while len(buf) >= CHUNK_SIZE:
chunk = buf[:CHUNK_SIZE]
del buf[:CHUNK_SIZE]
if random.random() < TRAIN_PROB:
train_batch.append(chunk)
train_count += 1
if len(train_batch) >= SHARD_SIZE_CHUNKS:
write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
train_shard += 1
train_batch = []
else:
val_batch.append(chunk)
val_count += 1
if len(val_batch) >= SHARD_SIZE_CHUNKS:
write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)
val_shard += 1
val_batch = []
# Flush leftovers
write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)
print({'train_chunks_written': train_count, 'val_chunks_written': val_count, 'leftover_tokens_dropped': len(buf)})
# Build HF Datasets from Parquet shards (memory-mapped; avoids loading everything at once)
train_parquet_files = sorted(glob(str(OUT_TRAIN_DIR / '*.parquet')))
val_parquet_files = sorted(glob(str(OUT_VAL_DIR / '*.parquet')))
train_hfds = Dataset.from_parquet(train_parquet_files)
val_hfds = Dataset.from_parquet(val_parquet_files)
print({'train_rows': len(train_hfds), 'val_rows': len(val_hfds)})
# Torch wrappers and DataLoaders (no attention_mask), with external shift in collate
class FixedLenHFDataset(TorchDataset):
def __init__(self, hf_ds: Dataset):
self.ds = hf_ds
def __len__(self):
return len(self.ds)
def __getitem__(self, idx):
ids = self.ds[idx]['input_ids']
return torch.tensor(ids, dtype=torch.long)
def collate_shift(batch):
x = torch.stack(batch) # (B, CHUNK_SIZE)
y = x.clone()
y[:, :-1] = x[:, 1:]
y[:, -1] = -100
return {'input_ids': x, 'targets': y}
train_fixed = FixedLenHFDataset(train_hfds)
val_fixed = FixedLenHFDataset(val_hfds)
training_dataloader = DataLoader(train_fixed, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_shift)
validation_dataloader = DataLoader(val_fixed, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_shift)
Notes on the collate function
- The
collate_shiftfunction preparesinput_idsandtargetsby shifting tokens left for next-token prediction and using-100as the ignore index for the final token. This keeps the loss computation simple and efficient.
Model architecture and Flash Attention
Goal: reduce attention memory pressure and improve throughput by using torch.nn.functional.scaled_dot_product_attention (Flash Attention style) while preserving causal masking.
Key points
- The
FlashAttentionclass computes Q/K/V via a single linear, reshapes into heads, and callsscaled_dot_product_attentionwithis_causal=True. - The transformer block (
TransformerBlockv2) usesLayerNorm, the FlashAttention module, and aFeedForwardmodule. - The top-level model
SydsGPTv2wires token and position embeddings, a stack ofTransformerBlockv2, final layer norm, and an output projection.
Code: FlashAttention, TransformerBlockv2, and SydsGPTv2
class FlashAttention(nn.Module):
def __init__(self, embedding_dim, num_heads, dropout=0.1):
super().__init__()
assert embedding_dim % num_heads == 0, "embedding_dim must be divisible by num_heads"
self.embedding_dim = embedding_dim
self.num_heads = num_heads
self.head_dim = embedding_dim // num_heads
self.dropout = dropout
self.qkv = nn.Linear(embedding_dim, 3 * embedding_dim)
self.out_proj = nn.Linear(embedding_dim, embedding_dim)
def forward(self, x):
batch_size, seq_length, _ = x.shape
qkv = self.qkv(x)
qkv = qkv.view(batch_size, seq_length, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
queries, keys, values = qkv
dropout = 0.0 if not self.training else self.dropout
context_vectors = torch.nn.functional.scaled_dot_product_attention(queries, keys, values, attn_mask = None, dropout_p = dropout, is_causal = True)
context_vectors = context_vectors.transpose(1, 2).contiguous().view(batch_size, seq_length, self.embedding_dim)
context_vectors = self.out_proj(context_vectors)
return context_vectors
from modules.LayerNorm import LayerNorm
from modules.FeedForward import FeedForward
import torch.nn as nn
class TransformerBlockv2(nn.Module):
def __init__(self, config):
super().__init__()
self.attention = FlashAttention(
embedding_dim = config["embedding_dim"],
num_heads = config["num_heads"],
dropout = config["dropout"],
)
self.layer_norm1 = LayerNorm(config["embedding_dim"])
self.feed_forward = FeedForward(config)
self.layer_norm2 = LayerNorm(config["embedding_dim"])
self.dropout = nn.Dropout(config["dropout"])
def forward(self, x):
shortcut = x
x = self.layer_norm1(x)
x = self.attention(x)
x = self.dropout(x)
x = x + shortcut
shortcut = x
x = self.layer_norm2(x)
x = self.feed_forward(x)
x = self.dropout(x)
x = x + shortcut
return x
import torch
import torch.nn as nn
from modules.TransformerBlock import TransformerBlock
from modules.LayerNorm import LayerNorm
class SydsGPTv2(nn.Module):
def __init__(self, config):
super().__init__()
self.token_embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
self.position_embedding = nn.Embedding(config["context_length"], config["embedding_dim"])
self.drop_embedding = nn.Dropout(config["dropout"])
self.transformer_blocks = nn.Sequential(*[TransformerBlockv2(config) for _ in range(config["num_layers"])])
self.final_layer_norm = LayerNorm(config["embedding_dim"])
self.output_projection = nn.Linear(config["embedding_dim"], config["vocab_size"], bias = False)
def forward(self, input):
batch_size, seq_length = input.shape
token_embeddings = self.token_embedding(input)
position_embeddings = self.position_embedding(torch.arange(seq_length, device=input.device))
x = token_embeddings + position_embeddings
x = self.drop_embedding(x)
x = self.transformer_blocks(x)
x = self.final_layer_norm(x)
logits = self.output_projection(x)
return logits
Practical validation
- Compare logits on a small batch between the FlashAttention model and a baseline to ensure numerical parity within tolerance.
- Confirm
is_causal=Trueto preserve autoregressive behavior. - Watch dtype:
scaled_dot_product_attentionsupports mixed precision; ensure your autocast andtorch.set_float32_matmul_precisionsettings align with your hardware.
Compilation and runtime flags
Goal: reduce Python overhead and fuse kernels where possible using torch.compile, and enable safe TF32/precision knobs on Ampere+ GPUs.
Code: performance flags and torch.compile
# Compile model with torch.compile and set performance flags
import torch
import contextlib
# Optional performance knobs (safe on Ampere+ GPUs; harmless on CPU)
try:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.backends.cudnn.benchmark = True
except Exception:
pass
# Prefer higher-precision matmul kernels if available on your hardware
with contextlib.suppress(Exception):
torch.set_float32_matmul_precision('high') # 'high' or 'medium'
# Choose a compile configuration
compile_backend = 'inductor' # default backend
compile_mode = 'default' # try 'reduce-overhead' or 'max-autotune' later
dynamic_shapes = False # set True if you plan to change batch size frequently
compile_ok = False
try:
model = torch.compile(model, backend=compile_backend, mode=compile_mode, dynamic=dynamic_shapes)
compile_ok = True
print(f"Model compiled with torch.compile (backend={compile_backend}, mode={compile_mode}, dynamic={dynamic_shapes})")
print("Note: First iteration includes compile time; subsequent steps are faster.")
except Exception as e:
print("torch.compile failed; falling back to eager. Error:\n", e)
Notes
- The first iteration after
torch.compileincludes compilation overhead; measure steady-state throughput after warmup. torch.backends.cudnn.benchmark = Truehelps when input sizes are stable.torch.set_float32_matmul_precision('high')can improve matmul performance on supported hardware.
Training loop: warmup, cosine decay, gradient accumulation, and checkpoints
Goals
- Stabilize early training with LR warmup.
- Use cosine decay to anneal LR smoothly across the full training horizon.
- Use gradient accumulation to simulate large effective batch sizes on a single GPU.
- Rotate checkpoints to limit disk usage while keeping recent history.
Hyperparameters used in the run
initial_lr = 1e-6,peak_lr = 1e-4,min_lr = 0.1 * peak_lr.- Warmup set to ~2% of steps per epoch.
grad_accum_steps = 64to scale effective batch size.checkpoint_interval = 10000steps (rotating saves).
Code: training function v2 (basic warmup + cosine decay)
import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text
def train_model_v2(model, training_dataloader, validation_dataloader, optimizer, device,
num_epochs, evaluation_frequency, start_context,
tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr):
training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
total_tokens_processed, global_step = 0, -1
total_training_steps = num_epochs * total_steps_per_epoch
lr_increment = (peak_lr - initial_lr) / warmup_steps
for epoch in range(num_epochs):
model.train()
for batch in training_dataloader:
optimizer.zero_grad()
global_step += 1
if global_step < warmup_steps:
lr = initial_lr + global_step * lr_increment
else:
progress = (global_step - warmup_steps) / (total_training_steps - warmup_steps)
lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
for param_group in optimizer.param_groups:
param_group['lr'] = lr
learning_rates.append(lr)
loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
loss.backward()
if global_step >= warmup_steps:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm = 1.0)
optimizer.step()
training_losses.append(loss.item())
total_tokens_processed += (batch['input_ids'] != -100).sum().item()
print(f"Epoch {epoch + 1}, Step {global_step}: Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item()}")
if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
model.eval()
val_batch = next(iter(validation_dataloader))
with torch.no_grad():
val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
validation_losses.append(val_loss.item())
print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item()} ---")
generate_sample_text(model, tokenizer, device, start_context)
model.train()
if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"
try:
if os.path.exists(prev1_ckpt):
os.remove(prev1_ckpt)
except Exception:
pass
try:
if os.path.exists(base_ckpt):
os.replace(base_ckpt, prev1_ckpt)
except Exception:
pass
torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt}")
return training_losses, validation_losses, total_tokens_processed, learning_rates
Code: optimizer instantiation
from galore_torch import GaLoreAdamW
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
Code: training function v3 (warmup + cosine + gradient accumulation + rotated checkpoints)
import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text
def train_model_v3(model, training_dataloader, validation_dataloader, optimizer, device,
num_epochs, evaluation_frequency, start_context,
tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr,
grad_accum_steps: int = 1):
"""
Train with cosine decay + warmup and optional gradient accumulation.
Notes:
- LR/warmup here are updated per batch (DataLoader iteration). If you want warmup
in optimizer steps, compute warmup_steps accordingly (divide by grad_accum_steps).
- loss is scaled by 1/grad_accum_steps before backward to keep gradients invariant.
"""
training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
total_tokens_processed, global_step = 0, -1
total_training_steps = num_epochs * total_steps_per_epoch
lr_increment = (peak_lr - initial_lr) / max(1, warmup_steps)
accum_counter = 0
optimizer.zero_grad(set_to_none=True)
for epoch in range(num_epochs):
model.train()
for batch in training_dataloader:
global_step += 1
# Learning rate schedule per batch step
if global_step < warmup_steps:
lr = initial_lr + global_step * lr_increment
else:
progress = (global_step - warmup_steps) / max(1, (total_training_steps - warmup_steps))
lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
for pg in optimizer.param_groups:
pg['lr'] = lr
learning_rates.append(lr)
# Forward + backward (accumulated)
loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
training_losses.append(loss.item()) # log unscaled loss
(loss / max(1, grad_accum_steps)).backward()
accum_counter += 1
# Token accounting (per batch)
total_tokens_processed += (batch['input_ids'] != -100).sum().item()
did_optimizer_step = False
if accum_counter % max(1, grad_accum_steps) == 0:
if global_step >= warmup_steps:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
did_optimizer_step = True
print(f"Epoch {epoch + 1}, Step {global_step} ({'opt-step' if did_optimizer_step else 'accumulating'}): Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item():.4f}, LR = {lr:.2e}")
# Periodic evaluation
if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
model.eval()
try:
val_batch = next(iter(validation_dataloader))
with torch.no_grad():
val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
validation_losses.append(val_loss.item())
print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item():.4f} ---")
generate_sample_text(model, tokenizer, device, start_context)
except StopIteration:
print("Validation loader empty; skipping eval.")
finally:
model.train()
# Checkpoint rotation (keep last 2)
if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"
prev2_ckpt = "autosave_ckpt1_prev2_sydsgpt_v2_164m_trained_model_optimizer.pth"
try:
if os.path.exists(prev2_ckpt):
os.remove(prev2_ckpt)
except Exception:
pass
try:
if os.path.exists(prev1_ckpt):
os.replace(prev1_ckpt, prev2_ckpt)
except Exception:
pass
try:
if os.path.exists(base_ckpt):
os.replace(base_ckpt, prev1_ckpt)
except Exception:
pass
torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt} | prev2 -> {prev2_ckpt}")
# Flush leftover grads at epoch end (if any)
if accum_counter % max(1, grad_accum_steps) != 0:
if global_step >= warmup_steps:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
print("Flushed leftover accumulated gradients at epoch end.")
return training_losses, validation_losses, total_tokens_processed, learning_rates
Practical tips
- Loss scaling: dividing the loss by
grad_accum_stepsbefore.backward()keeps gradient magnitudes consistent with larger batch training. - Gradient clipping: apply only at optimizer step time to avoid clipping partial gradients repeatedly.
- Checkpoint rotation: keeps disk usage bounded while preserving recent history for recovery.
Running the experiment and saving final weights
- Iinstantiate
SydsGPTv2with a 164M-parameter configuration and compile it if possible. - Used
GaLoreAdamWwithweight_decay=0.05. - Ran
train_model_v3withgrad_accum_steps = 64and saved the final model as"sydsgpt_v2_164m_trained_model-11.8B.pth".
Code: training invocation and final save
from galore_torch import GaLoreAdamW
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
num_epochs = 2
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v2(
model,
training_dataloader,
validation_dataloader,
optimizer,
device,
num_epochs,
evaluation_frequency = 10000,
start_context = "Once upon a time",
tokenizer = enc,
checkpoint_interval = 10000,
total_steps_per_epoch = total_steps_per_epoch,
warmup_steps = warmup_steps,
initial_lr = initial_lr,
peak_lr = peak_lr,
min_lr = min_lr
)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
checkpoint = torch.load("autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.to(device)
model.to(device)
num_epochs = 1
grad_accum_steps = 64 # effective batch = BATCH_SIZE * grad_accum_steps
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v3(
model,
training_dataloader,
validation_dataloader,
optimizer,
device,
num_epochs,
evaluation_frequency = 10000,
start_context = "Once upon a time",
tokenizer = enc,
checkpoint_interval = 10000,
total_steps_per_epoch = total_steps_per_epoch,
warmup_steps = warmup_steps,
initial_lr = initial_lr,
peak_lr = peak_lr,
min_lr = min_lr,
grad_accum_steps = grad_accum_steps
)
torch.save(model.state_dict(), "sydsgpt_v2_164m_trained_model-11.8B.pth")
Notes
- The notebook shows both and being used; is the final training function with gradient accumulation.
- with yields an effective batch size of 128, which is a practical way to approximate larger-batch training on a single GPU.
Loading and generation
Code: loading the final checkpoint and generating text
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt_v2_164m_trained_model-11.8B.pth", map_location=device))
model.to(device)
from modules.Generate import generate, text_to_tokens, tokens_to_text
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
input_text = "A deep neural network is a type of artificial neural network with multiple layers between the input and output layers, which allows it to learn hierarchical patterns in data."
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(model, input_tokens, 1000, SYDSGPT_CONFIG_V2_164M['context_length'], temperature = 1.5, top_k = 40)
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text:\n {output_text}")
What to watch during generation
- Temperature and top-k: higher temperature and top-k produce more diverse outputs but can also increase incoherence.
- Context length: ensure the input fits within
context_lengthor is truncated appropriately. - Token-to-text mapping: use the same
tiktokenencoder used during training to avoid tokenization mismatches.
Observations from the run
Throughput and runtime
- Running ~12B tokens on a single 3080 Ti required careful memory management: Flash Attention, gradient accumulation, and mixed-precision-friendly flags were essential.
torch.compilecan reduce Python overhead and improve steady-state throughput, but the first iteration includes compilation time. Measure both compile time and steady-state tokens/sec.- Parquet shards and memory-mapped HF datasets kept RAM usage low and allowed streaming large corpora without loading everything into memory.
Stability
- LR warmup prevented early divergence. A small
initial_lrand a short warmup window (2% of steps per epoch) stabilized the first phase. - Cosine decay provided a smooth annealing schedule across the full run.
- Gradient clipping applied at optimizer step time helped avoid gradient explosions after warmup.
Practical trade-offs
- Shard size: larger shards reduce file count but increase I/O per read; tune
SHARD_SIZE_CHUNKSto your disk and training pattern. - Batch size vs. accumulation: accumulation increases effective batch size but increases wall-clock time per optimizer step; choose
grad_accum_stepsto balance memory and throughput. - Checkpoint cadence: frequent checkpoints increase disk usage and I/O; rotating saves keep recent history while bounding storage.
Lessons learned and recommendations
Data
- Pre-tokenize and persist tokenized rows to avoid repeated tokenization and to get accurate token counts for planning.
- Use deterministic sharding and manifest files for reproducibility.
Model
- Flash Attention (or
scaled_dot_product_attention) is a practical way to reduce memory pressure and increase throughput on consumer GPUs. Validate numerical parity with a baseline.
Training
- Warmup + cosine decay is a robust schedule for long runs.
- Gradient accumulation is essential for single-GPU large-scale pretraining. Ensure correct loss scaling and clipping semantics.
- Use rotating checkpoints to limit disk usage while keeping recoverability.
Performance
torch.compilecan help but measure compile overhead vs. steady-state gains.- Enable TF32 and
set_float32_matmul_precisionon Ampere+ GPUs for faster matmuls when acceptable.
Final thoughts
This part of the series demonstrates how careful engineering across the data pipeline, attention kernel, training loop, and runtime configuration makes large-scale pretraining feasible even on constrained hardware. The code provided is a practical, reproducible blueprint: tokenize once, shard into fixed-length chunks, stream shards with memory-mapped HF datasets, replace attention with a Flash Attention style kernel, compile the model when possible, and run a disciplined training loop with warmup, cosine decay, gradient accumulation, and rotating checkpoints.
Try It Yourself
The full notebook with all the steps, from preparing the corpus, data loaders, loss computation, pretraining loop, text sampling and generation, is available here:
SydsGPT pretraining on Large corpus Repository
Clone the repo, open the Jupyter notebook, and step through the code.
Build It Yourself
If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!
What comes next
Part 8 will focus on fine-tuning. Specifically on instruction fine‑tuning and alignment: curate and clean an instruction‑style dataset (paired prompts and high‑quality responses), normalize formatting and tokenization to match the pretraining pipeline, and split into train/validation shards for reproducible experiments. Experiment with lightweight adaptation methods first (LoRA/PEFT or adapters) to get fast iteration on learning rates, weight decay, and few‑epoch schedules before committing to full‑model fine‑tuning
Later, I will add tool calling for web search and build a RAG pipeline to interact with private data. The aim is a private assistant that respects privacy and delivers practical value, proving that small models can go far when engineered with care
Source Code
Imports Overview¶
This cell sets up the core utilities needed for the data ingestion and preparation pipeline.
from datasets import Dataset: Provides theDatasetclass (used for typing, inspection, and potential construction of new datasets later in the workflow).load_dataset: Downloads and constructs Hugging Face datasets from remote hubs (used later for FineWeb, Wikipedia, and ArXiv corpora).concatenate_datasets: Merges multiple homogeneousDatasetobjects into one unified dataset (used after individual cleaning steps to form a combined corpus).load_from_disk: Reloads previously persisted datasets (enables multi-stage processing without recomputation).import random: Supplies PRNG utilities (later used for stochastic train/validation assignment when chunking token sequences and shuffling operations).
Why These Imports Are Here¶
They form the foundation for a multi-phase pipeline:
- Load raw text corpora.
- Normalize schema to a single
textcolumn. - Persist intermediate results for reproducibility and restartability.
- Re-load, shuffle, subset, and concatenate.
- Stream tokens into fixed-size chunks with probabilistic splitting (requires
random).
No execution or side effects occur in this cell; it strictly prepares functionality used in subsequent cells.
from datasets import Dataset, load_dataset, concatenate_datasets, load_from_disk
import random
e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Dataset Ingestion: FineWeb, Wikipedia (EN 2023-11-01), ArXiv (Pile subset)¶
This stage downloads three large-scale text corpora via Hugging Face load_dataset using the train split for each:
- FineWeb (
HuggingFaceFW/fineweb, config:sample-10BT)- High-quality web crawl subset.
- Contains multiple metadata columns; only
textwill be retained later.
- Wikipedia (
wikimedia/wikipedia, config:20231101.en)- Clean encyclopedic prose.
- Rich structured fields (e.g., id, url, title); we normalize to raw
text.
- ArXiv (
timaeus/pile-arxiv)- Scientific/technical writing.
- Complements general + encyclopedic domains with formal research style.
Why load them separately first?¶
- Allows per-corpus cleaning (column pruning) before concatenation.
- Avoids early memory pressure from merging heterogeneous schemas.
- Facilitates caching: each dataset is stored once under
~/.cache/huggingface/datasets.
Performance / Memory Notes¶
- Initial load may be disk + network bound; subsequent runs reuse cache.
- If RAM constrained, consider:
- Using
streaming=Trueand later materializing only needed samples. - Subsetting via
.select(...)before tokenization (already done later at 50% trim).
- Using
- Order does not matter; each returns a standalone
Datasetobject.
Rationale for Multi-Corpus Mix¶
- Web (diverse style) + encyclopedic (factual structure) + scientific (formal reasoning) improves stylistic robustness.
- Balancing domains early avoids overfitting to a single register.
No side effects beyond network download and cache population occur here; mutation (column removal, shuffling, saving) is deferred to subsequent cells.
fineweb_dataset = load_dataset("HuggingFaceFW/fineweb", name="sample-10BT", split="train")
wikipedia_dataset = load_dataset("wikimedia/wikipedia", "20231101.en", split="train")
arxiv_dataset = load_dataset("timaeus/pile-arxiv", split="train")
Column Pruning / Schema Normalization (Next Code Cell)¶
The upcoming code cell (below this markdown) reduces each corpus to a single canonical text field:
fineweb_dataset = fineweb_dataset.remove_columns([col for col in fineweb_dataset.column_names if col != 'text'])
wikipedia_dataset = wikipedia_dataset.remove_columns([col for col in wikipedia_dataset.column_names if col != 'text'])
arxiv_dataset = arxiv_dataset.remove_columns([col for col in arxiv_dataset.column_names if col != 'text'])
Persist Normalized Corpora to Disk¶
This step serializes each individually cleaned Hugging Face Dataset (FineWeb, Wikipedia, ArXiv) to local storage under the data/ directory. After the prior schema normalization (only the text column retained in the previous code cell), saving achieves:
Why Save Individually?¶
- Enables fast restart: subsequent runs skip remote download + column pruning by directly calling
load_from_disk(...). - Modular pipeline stages: tokenization, shuffling, trimming, concatenation occur later without recomputing earlier ingestion work.
- Caching granularity: you can delete or recompute one corpus without touching the others.
- Debugging / inspection: load a single corpus to probe stats or quality before mixing.
File Layout¶
Each call creates a directory.
fineweb_dataset.save_to_disk('data/fineweb_dataset')
wikipedia_dataset.save_to_disk('data/wikipedia_dataset')
arxiv_dataset.save_to_disk('data/arxiv_dataset')
Saving the dataset (93/93 shards): 100%|██████████| 14868862/14868862 [02:27<00:00, 101009.83 examples/s] Saving the dataset (40/40 shards): 100%|██████████| 6407814/6407814 [01:23<00:00, 76959.09 examples/s] Saving the dataset (10/10 shards): 100%|██████████| 100000/100000 [00:19<00:00, 5021.42 examples/s]
Reloading Normalized Datasets from Disk¶
This cell restores each previously saved, schema-normalized Hugging Face Dataset (FineWeb, Wikipedia, ArXiv) from local storage. The datasets were saved after column pruning, so each contains only the canonical text field.
Purpose¶
- Fast Restart: Avoids repeating remote downloads and column normalization. Enables resuming the pipeline from disk.
- Modular Processing: Each corpus can be independently inspected, shuffled, subsetted, or concatenated in later steps.
- Reproducibility: Ensures that subsequent operations (shuffling, chunking, tokenization) use exactly the same data as prior runs.
File Paths¶
data/fineweb_dataset: FineWeb corpus, normalized totextcolumn.data/wikipedia_dataset: Wikipedia corpus, normalized totextcolumn.data/arxiv_dataset: ArXiv corpus, normalized totextcolumn.
Next Steps¶
After loading, the datasets will be shuffled and trimmed (see the following code cell), then concatenated for further processing.
No mutation or side effects occur in this cell; it strictly loads datasets into memory for downstream use.
fineweb_dataset = load_from_disk('data/fineweb_dataset')
wikipedia_dataset = load_from_disk('data/wikipedia_dataset')
arxiv_dataset = load_from_disk('data/arxiv_dataset')
Shuffling and Trimming Each Corpus Before Concatenation¶
This code cell performs two key preprocessing steps on each normalized dataset (FineWeb, Wikipedia, ArXiv):
Shuffling: Randomizes the order of samples in each corpus using a fixed seed (
seed=42) for reproducibility. Shuffling ensures that downstream splits (train/validation) and chunking do not inherit any ordering bias from the original datasets.Trimming: Selects only the first 50% of each shuffled dataset. This reduces memory and compute requirements for subsequent steps, making the pipeline more manageable for experimentation or resource-constrained environments.
Why Shuffle and Trim Separately?¶
- Shuffling before trimming ensures that the subset is a representative sample of the full corpus, not just the first half of the original ordering.
- Trimming after shuffling allows for rapid prototyping and testing without processing the entire dataset.
Output¶
- Each variable (
fineweb_dataset,wikipedia_dataset,arxiv_dataset) now contains a shuffled and trimmed version of the original corpus, ready for concatenation into a single combined dataset.
Next Steps¶
- The processed datasets will be concatenated in the following cell to form a unified corpus for tokenization and model training.
fineweb_dataset = fineweb_dataset.shuffle(seed=42)
wikipedia_dataset = wikipedia_dataset.shuffle(seed=42)
arxiv_dataset = arxiv_dataset.shuffle(seed=42)
# trim 50% of the dataset
fineweb_dataset = fineweb_dataset.select(range(len(fineweb_dataset)//2))
wikipedia_dataset = wikipedia_dataset.select(range(len(wikipedia_dataset)//2))
arxiv_dataset = arxiv_dataset.select(range(len(arxiv_dataset)//2))
Concatenating Shuffled and Trimmed Datasets¶
This code cell merges the three preprocessed corpora—FineWeb, Wikipedia, and ArXiv—into a single Hugging Face Dataset using concatenate_datasets. Each input dataset has already been:
- Normalized to contain only the
textcolumn. - Shuffled with a fixed seed for reproducibility.
- Trimmed to the first 50% of samples for efficient experimentation.
Why Concatenate Now?¶
- Unified Corpus: Combines diverse writing styles (web, encyclopedic, scientific) into one dataset for downstream tokenization and model training.
- Consistent Schema: All datasets share the same column structure (
text), enabling seamless merging. - Balanced Sampling: Shuffling and trimming ensure that each domain is fairly represented in the final mix.
Output¶
- The resulting
combined_datasetcontains all rows from the three sources, ready for tokenization and chunking.
Next Steps¶
- Save the combined dataset to disk for reproducibility and fast reloads.
- Tokenize the unified corpus and prepare it for model training.
No mutation occurs to the original datasets; only a new combined dataset is created.
#Concatenate the datasets
combined_dataset = concatenate_datasets([fineweb_dataset, wikipedia_dataset, arxiv_dataset])
Saving the Combined Dataset to Disk¶
This code cell persists the unified Hugging Face Dataset—created by concatenating the shuffled and trimmed FineWeb, Wikipedia, and ArXiv corpora—to local storage at data/combined_dataset. Saving the combined dataset achieves several goals:
Why Save the Combined Dataset?¶
- Fast Reloads: Enables rapid restart of the pipeline from the merged corpus, skipping all prior ingestion, normalization, shuffling, and trimming steps.
- Reproducibility: Guarantees that downstream tokenization and chunking operate on exactly the same data as previous runs.
- Modular Processing: Facilitates experimentation with tokenization, chunking, or model training without repeating earlier preprocessing.
- Disk-Based Workflow: Reduces RAM requirements by allowing later stages to stream or memory-map the dataset from disk.
Output¶
- The directory
data/combined_datasetwill contain the serialized dataset, ready for tokenization and chunking in subsequent steps.
No mutation occurs to the original datasets; only the combined dataset is saved.
combined_dataset.save_to_disk('data/combined_dataset')
Saving the dataset (71/71 shards): 100%|██████████| 10688338/10688338 [17:41<00:00, 10070.76 examples/s]
Tokenization and Disk Persistence of the Combined Dataset¶
This code cell performs two critical steps for preparing the unified corpus for model training:
Tokenization:
- Utilizes the Hugging Face
tiktokenlibrary with the GPT-2 tokenizer (enc). - Applies the
tokenize_and_savefunction to the loaded combined dataset. - Each sample’s
textis converted into a list of integer token IDs (input_ids) and its length (length). - The original
textcolumn is removed (keep_text=False) to save disk space.
- Utilizes the Hugging Face
Saving to Disk:
- The tokenized dataset is serialized to
data/combined_tokenized_datasetfor fast reloads and reproducibility. - This enables downstream chunking and training to operate directly on token sequences, bypassing repeated tokenization.
- The tokenized dataset is serialized to
Outputs¶
dataset_tokenized: The Hugging FaceDatasetcontaining tokenized samples (input_ids,length).dataset_token_count: The total number of tokens in the combined corpus (for reporting and scaling experiments).
Why This Step Matters¶
- Efficiency: Tokenization is compute-intensive; saving results avoids redundant work.
- Modularity: Downstream steps (chunking, batching, training) can be restarted from the tokenized dataset.
- Disk-Based Workflow: Reduces RAM requirements and supports scalable data streaming.
Next Steps¶
- The tokenized dataset will be chunked into fixed-length sequences and split into train/validation sets for model training.
import tiktoken
enc = tiktoken.get_encoding("gpt2")
from typing import Optional, Tuple
from datasets import Dataset
# Tokenize a HF Dataset and save to disk. Returns (tokenized_dataset, total_tokens)
def tokenize_and_save(
dataset: Dataset,
out_dir: str,
text_column: str = 'text',
keep_text: bool = False,
batch_size: int = 1000,
num_proc: Optional[int] = None,
) -> Tuple[Dataset, int]:
"""
- Tokenizes each row's text using the global `enc` (tiktoken GPT-2).
- Adds 'input_ids' (List[int]) and 'length' (int) columns.
- Optionally removes the original 'text' column to save space.
- Saves the resulting dataset to `out_dir`.
Returns the tokenized dataset and the total token count.
"""
def tok_batch(batch):
texts = batch[text_column]
input_ids = [enc.encode(t, allowed_special={'<|endoftext|>'}) for t in texts]
lengths = [len(ids) for ids in input_ids]
return {'input_ids': input_ids, 'length': lengths}
remove_cols = None if keep_text else [text_column]
tokenized = dataset.map(
tok_batch,
batched=True,
batch_size=batch_size,
num_proc=num_proc,
remove_columns=remove_cols,
desc=f"Tokenizing -> {out_dir}",
)
# Compute total tokens efficiently by summing the 'length' column
total_tokens = int(sum(tokenized['length']))
# Persist to disk
tokenized.save_to_disk(out_dir)
return tokenized, total_tokens
dataset_path = 'data/combined_dataset'
tokenized_dataset_path = 'data/combined_tokenized_dataset'
# Load datasets from disk
dataset = load_from_disk(dataset_path)
dataset_tokenized, dataset_token_count = tokenize_and_save(dataset, tokenized_dataset_path, keep_text=False, batch_size=1000, num_proc=None)
print('Tokenized dataset sizes (rows):', {
'combined_rows': len(dataset_tokenized),
})
print('Per-dataset token counts:', {
'combined_tokens': dataset_token_count,
})
Tokenizing -> data/combined_tokenized_dataset: 100%|██████████| 10688338/10688338 [1:08:59<00:00, 2582.15 examples/s] Saving the dataset (67/67 shards): 100%|██████████| 10688338/10688338 [02:11<00:00, 81362.85 examples/s]
Tokenized dataset sizes (rows): {'combined_rows': 10688338}
Per-dataset token counts: {'combined_tokens': 8327777943}
Streaming Chunking and Parquet Sharding of Tokenized Dataset¶
This code cell implements a memory-efficient streaming chunker for the tokenized dataset, writing fixed-length (2048-token) chunks to Parquet files for both training and validation splits. The process avoids flattening the entire corpus into RAM, instead incrementally buffering tokens and flushing shards to disk.
Key Steps:¶
Directory Setup & Cleanup
- Creates output directories for train/val shards.
- Removes any existing
.parquetfiles to avoid mixing old and new data.
Chunking Logic
- Streams over the tokenized dataset row-by-row.
- Buffers tokens until at least one full chunk (
CHUNK_SIZE= 2048) is available. - Each chunk is randomly assigned to train (80%) or validation (20%) split.
Shard Writing
- Chunks are accumulated in batches (
SHARD_SIZE_CHUNKS= 25,000). - Once a batch is full, it is written to a Parquet file using PyArrow.
- This process repeats until all data is processed.
- Chunks are accumulated in batches (
Finalization
- Any remaining chunks are flushed to disk.
- Reports the total number of train/val chunks written and leftover tokens (not enough to form a full chunk).
Why This Matters:¶
- Scalability: Handles massive datasets without exceeding RAM limits.
- Fast Loading: Parquet shards can be memory-mapped and loaded efficiently for training.
- Balanced Splits: Ensures train/val splits are randomized at the chunk level, not by document.
Output:¶
- Parquet files in
data/combined_chunks_train_parquetanddata/combined_chunks_val_parquet, each containing lists of 2048-token chunks. - Printed summary of chunk counts and dropped tokens.
This cell prepares the data for efficient downstream training with PyTorch DataLoaders.
import os
import random
from pathlib import Path
from glob import glob
import pyarrow as pa
import pyarrow.parquet as pq
import torch
from torch.utils.data import Dataset as TorchDataset, DataLoader
from datasets import load_from_disk, Dataset
e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
# Config
SOURCE_PATH = 'data/combined_tokenized_dataset' # variable-length input_ids
OUT_TRAIN_DIR = Path('data/combined_chunks_train_parquet')
OUT_VAL_DIR = Path('data/combined_chunks_val_parquet')
CHUNK_SIZE = 2048
TRAIN_PROB = 0.8 # approximate 80/20 split at chunk level
SHARD_SIZE_CHUNKS = 25000 # number of chunks per parquet shard (tune for memory/disk throughput)
BATCH_SIZE = 2
SEED = 42
Memory-safe streaming chunker: write 2048-token shards to Parquet, then build loaders (no attention_mask)¶
If flattening into one giant list runs out of memory, stream rows and write fixed-size chunks incrementally to Parquet shards. Then, load the shards as a Hugging Face Dataset and build PyTorch DataLoaders with externally shifted labels.
random.seed(SEED)
OUT_TRAIN_DIR.mkdir(parents=True, exist_ok=True)
OUT_VAL_DIR.mkdir(parents=True, exist_ok=True)
# Optional: clean old shards (only .parquet files)
for f in list(OUT_TRAIN_DIR.glob('*.parquet')) + list(OUT_VAL_DIR.glob('*.parquet')):
try:
f.unlink()
except Exception:
pass
# Helper to write a shard of chunks to Parquet
# chunks: List[List[int]] (all must be CHUNK_SIZE long)
def write_parquet_shard(chunks, out_dir: Path, shard_idx: int):
if not chunks:
return
array = pa.array(chunks, type=pa.list_(pa.int32()))
table = pa.table({'input_ids': array})
pq.write_table(table, out_dir / f'part-{shard_idx:05d}.parquet')
# Stream over dataset and produce fixed-size chunks
buf = [] # token buffer
train_batch, val_batch = [], []
train_shard, val_shard = 0, 0
train_count, val_count = 0, 0
src = load_from_disk(SOURCE_PATH)
print('Streaming rows:', len(src))
for row in src:
toks = row['input_ids']
if not toks:
continue
buf.extend(toks)
while len(buf) >= CHUNK_SIZE:
chunk = buf[:CHUNK_SIZE]
del buf[:CHUNK_SIZE]
if random.random() < TRAIN_PROB:
train_batch.append(chunk)
train_count += 1
if len(train_batch) >= SHARD_SIZE_CHUNKS:
write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
train_shard += 1
train_batch = []
else:
val_batch.append(chunk)
val_count += 1
if len(val_batch) >= SHARD_SIZE_CHUNKS:
write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)
val_shard += 1
val_batch = []
# Flush leftovers
write_parquet_shard(train_batch, OUT_TRAIN_DIR, train_shard)
write_parquet_shard(val_batch, OUT_VAL_DIR, val_shard)
print({'train_chunks_written': train_count, 'val_chunks_written': val_count, 'leftover_tokens_dropped': len(buf)})
e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Streaming rows: 10688338
{'train_chunks_written': 3254400, 'val_chunks_written': 811897, 'leftover_tokens_dropped': 1687}
{'train_chunks_written': 3254400, 'val_chunks_written': 811897, 'leftover_tokens_dropped': 1687}
Downloading data: 100%|██████████| 131/131 [00:00<00:00, 13507.73files/s] Generating train split: 3254400 examples [01:20, 40405.54 examples/s] Downloading data: 100%|██████████| 33/33 [00:00<00:00, 30460.39files/s] Downloading data: 100%|██████████| 33/33 [00:00<00:00, 30460.39files/s] Generating train split: 811897 examples [00:18, 43109.80 examples/s]
{'train_rows': 3254400, 'val_rows': 811897}
Batch shapes: {'input_ids': torch.Size([8, 2048]), 'labels': torch.Size([8, 2048])}
input_ids[0][:10]: tensor([ 70, 459, 1839, 21740, 379, 262, 2351, 412, 396, 6048])
labels[0][:10]: tensor([ 459, 1839, 21740, 379, 262, 2351, 412, 396, 6048, 69])
Building PyTorch DataLoaders from Parquet-Sharded Hugging Face Datasets¶
This code cell performs the following steps to prepare efficient PyTorch DataLoaders for training and validation:
Load Parquet Shards as Hugging Face Datasets
- Uses
globto collect all.parquetfiles from the train and validation chunk directories. - Loads these files with
Dataset.from_parquet, enabling memory-mapped access to large datasets without loading everything into RAM.
- Uses
Print Dataset Sizes
- Reports the number of rows (chunks) in both train and validation sets for sanity checking.
PyTorch Dataset Wrapper
- Defines
FixedLenHFDataset, a wrapper that converts each Hugging Face dataset row (a list of token IDs) into a PyTorch tensor. - Ensures each sample is of fixed length (
CHUNK_SIZE), suitable for transformer training.
- Defines
Custom Collate Function for Language Modeling
- Implements
collate_shift, which stacks batches and shifts targets by one position (next-token prediction). - The last target token is set to
-100to mask it from loss computation.
- Implements
Instantiate Datasets and DataLoaders
- Wraps the train and validation Hugging Face datasets with
FixedLenHFDataset. - Creates PyTorch DataLoaders with appropriate batch size and shuffling for training, and disables shuffling for validation.
- Applies the custom collate function to produce
input_idsandtargetstensors for each batch.
- Wraps the train and validation Hugging Face datasets with
Why This Matters¶
- Scalability: Handles billions of tokens efficiently by streaming from disk.
- Correct Labeling: Ensures next-token prediction targets are properly aligned for autoregressive training.
- Modularity: Separates data loading, batching, and collation for easy experimentation and debugging.
Output¶
training_dataloaderandvalidation_dataloaderobjects, ready for use in the training loop.- Printed summary of dataset sizes for verification.
# Build HF Datasets from Parquet shards (memory-mapped; avoids loading everything at once)
train_parquet_files = sorted(glob(str(OUT_TRAIN_DIR / '*.parquet')))
val_parquet_files = sorted(glob(str(OUT_VAL_DIR / '*.parquet')))
train_hfds = Dataset.from_parquet(train_parquet_files)
val_hfds = Dataset.from_parquet(val_parquet_files)
print({'train_rows': len(train_hfds), 'val_rows': len(val_hfds)})
# Torch wrappers and DataLoaders (no attention_mask), with external shift in collate
class FixedLenHFDataset(TorchDataset):
def __init__(self, hf_ds: Dataset):
self.ds = hf_ds
def __len__(self):
return len(self.ds)
def __getitem__(self, idx):
ids = self.ds[idx]['input_ids']
return torch.tensor(ids, dtype=torch.long)
def collate_shift(batch):
x = torch.stack(batch) # (B, CHUNK_SIZE)
y = x.clone()
y[:, :-1] = x[:, 1:]
y[:, -1] = -100
return {'input_ids': x, 'targets': y}
train_fixed = FixedLenHFDataset(train_hfds)
val_fixed = FixedLenHFDataset(val_hfds)
training_dataloader = DataLoader(train_fixed, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_shift)
validation_dataloader = DataLoader(val_fixed, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_shift)
{'train_rows': 3254400, 'val_rows': 811897}
Inspecting a Single Training Batch¶
This code cell retrieves the next batch from the training_dataloader and prints key diagnostics:
- Batch Shapes: Displays the tensor shapes for both
input_idsandtargetsin the batch. This confirms the batch size and sequence length (should matchBATCH_SIZEandCHUNK_SIZE). - First 10 Tokens: Shows the first 10 token IDs from the first sample in both
input_idsandtargets. This helps verify that the input and target tensors are correctly aligned for next-token prediction (target is input shifted left by one, with the last token masked as-100).
Why This Matters¶
- Sanity Check: Ensures that the DataLoader, collation function, and chunking pipeline are producing batches in the expected format for language modeling.
- Debugging: Quick inspection of token values and shapes can catch errors in preprocessing, batching, or collation before training begins.
No mutation or side effects occur; this cell is purely for inspection and debugging.
batch = next(iter(training_dataloader))
print('Batch shapes:', {k: v.shape for k, v in batch.items()})
print('input_ids[0][:10]:', batch['input_ids'][0][:10])
print('targets[0][:10]: ', batch['targets'][0][:10])
Model Imports, Instantiation, and Device Setup¶
This code cell performs the following steps to prepare the SydsGPT model for evaluation or training:
Imports
- Imports PyTorch (
torch) and its neural network module (torch.nn).
- Imports PyTorch (
Model Definition and Configuration
- Imports the
SydsGPTmodel class from the localmodel.SydsGPTmodule. - Defines the configuration dictionary
SYDSGPT_CONFIG_164Mfor a 164M parameter GPT-style model, specifying:vocab_size: Size of the tokenizer vocabulary (GPT-2 default: 50257).context_length: Maximum sequence length (2048 tokens).embedding_dim: Embedding dimension (768).num_heads: Number of attention heads (12).num_layers: Number of transformer blocks (12).dropout: Dropout rate (0.1).qkv_bias: Whether to use bias in QKV projections (False).
- Imports the
Model Instantiation and Device Placement
- Sets a manual random seed (
torch.manual_seed(246)) for reproducibility. - Instantiates the SydsGPT model with the specified configuration.
- Detects the available device (GPU if available, otherwise CPU) and moves the model to that device.
- Sets the model to evaluation mode (
model.eval()), disabling dropout and other training-specific behaviors.
- Sets a manual random seed (
Why This Matters¶
- Reproducibility: Setting the random seed ensures consistent initialization.
- Device Awareness: Automatically uses GPU acceleration if available for faster inference/training.
- Model Readiness: The instantiated model is ready for forward passes, parameter inspection, or further fine-tuning.
No training or inference occurs in this cell; it strictly prepares the model and device context for downstream use.
import torch
import torch.nn as nn
from model.SydsGPT import SydsGPT
SYDSGPT_CONFIG_164M = {
"vocab_size" : 50257,
"context_length" : 2048,
"embedding_dim" : 768,
"num_heads" : 12,
"num_layers" : 12,
"dropout" : 0.1,
"qkv_bias" : False
}
torch.manual_seed(246)
model = SydsGPT(SYDSGPT_CONFIG_164M)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
SydsGPT(
(token_embedding): Embedding(50257, 768)
(position_embedding): Embedding(2048, 768)
(drop_embedding): Dropout(p=0.1, inplace=False)
(transformer_blocks): Sequential(
(0): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(3): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(4): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(5): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(6): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(7): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(8): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(9): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(10): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(11): TransformerBlock(
(attention): MultiHeadAttention(
(weight_query): Linear(in_features=768, out_features=768, bias=False)
(weight_key): Linear(in_features=768, out_features=768, bias=False)
(weight_value): Linear(in_features=768, out_features=768, bias=False)
(dropout): Dropout(p=0.1, inplace=False)
(output_projection): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
(final_layer_norm): LayerNorm()
(output_projection): Linear(in_features=768, out_features=50257, bias=False)
)
Calculating and Displaying Total Parameters in SydsGPT Model¶
This code cell computes the total number of trainable parameters in the instantiated SydsGPT model and prints the result. This is a crucial diagnostic step for understanding the model’s scale and verifying that the architecture matches expectations.
What Happens in This Cell¶
Parameter Counting:
Uses a generator expression to iterate over all parameters in themodelobject and sums their element counts (numel()), which gives the total number of scalar weights and biases in the model.Output:
Prints the total parameter count in a human-readable format, allowing you to confirm the model’s size (e.g., for reporting, scaling experiments, or comparing with published architectures).
Why This Matters¶
Model Size Verification:
Ensures that the SydsGPT model has been instantiated with the correct configuration and matches the intended parameter count (e.g., 164M for the provided config).Resource Planning:
Knowing the parameter count helps estimate memory requirements and training time.
No mutation or side effects occur; this cell is purely for inspection and reporting.
total_parameters = sum(parameter.numel() for parameter in model.parameters())
print(f"Total Parameters in SydsGPT Model: {total_parameters}")
Total Parameters in SydsGPT Model: 163795968
FlashAttention Module: Efficient Causal Self-Attention Layer¶
This code cell defines the FlashAttention class, an efficient implementation of multi-head causal self-attention using PyTorch’s built-in scaled_dot_product_attention primitive. The module is designed for transformer architectures and supports dropout during training.
Key Components¶
Initialization (
__init__)embedding_dim: Dimensionality of input embeddings.num_heads: Number of attention heads.head_dim: Computed asembedding_dim // num_heads.qkv: Linear layer projecting input to concatenated queries, keys, and values.out_proj: Linear layer projecting the output of attention back to the embedding dimension.dropout: Dropout probability applied to attention weights during training.
Forward Pass (
forward)- Projects input
xto queries, keys, and values. - Reshapes and permutes tensors to
[batch, heads, seq, head_dim]format. - Applies causal self-attention using
torch.nn.functional.scaled_dot_product_attentionwith causal masking (is_causal=True). - Applies dropout only during training.
- Projects the attended output back to the original embedding dimension.
- Projects input
Why Use This Module?¶
- Performance: Leverages PyTorch’s optimized attention kernel for speed and memory efficiency.
- Causality: Ensures autoregressive masking for language modeling tasks.
- Modularity: Can be plugged into transformer blocks for building GPT-style models.
No side effects or external dependencies are introduced; this cell strictly defines the FlashAttention module for use in subsequent model construction.
class FlashAttention(nn.Module):
def __init__(self, embedding_dim, num_heads, dropout=0.1):
super().__init__()
assert embedding_dim % num_heads == 0, "embedding_dim must be divisible by num_heads"
self.embedding_dim = embedding_dim
self.num_heads = num_heads
self.head_dim = embedding_dim // num_heads
self.dropout = dropout
self.qkv = nn.Linear(embedding_dim, 3 * embedding_dim)
self.out_proj = nn.Linear(embedding_dim, embedding_dim)
def forward(self, x):
batch_size, seq_length, _ = x.shape
qkv = self.qkv(x)
qkv = qkv.view(batch_size, seq_length, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
queries, keys, values = qkv
dropout = 0.0 if not self.training else self.dropout
context_vectors = torch.nn.functional.scaled_dot_product_attention(queries, keys, values, attn_mask = None, dropout_p = dropout, is_causal = True)
context_vectors = context_vectors.transpose(1, 2).contiguous().view(batch_size, seq_length, self.embedding_dim)
context_vectors = self.out_proj(context_vectors)
return context_vectors
flash_attention = FlashAttention(embedding_dim = SYDSGPT_CONFIG_164M['embedding_dim'], num_heads = SYDSGPT_CONFIG_164M['num_heads'], dropout=SYDSGPT_CONFIG_164M['dropout']).to(device)
embeddings = torch.randn((8, SYDSGPT_CONFIG_164M['context_length'], SYDSGPT_CONFIG_164M['embedding_dim']), device=device)
print('Embeddings shape:', embeddings.shape)
output = flash_attention(embeddings)
print('Flash Attention output shape:', output.shape)
TransformerBlockv2: Residual Block with FlashAttention, LayerNorm, and FeedForward¶
This cell defines the TransformerBlockv2 class, a modular transformer block for GPT-style architectures. It integrates efficient causal self-attention (via FlashAttention), layer normalization, and a feed-forward network, all wrapped with residual connections and dropout for stability.
Components¶
Attention Layer:
UsesFlashAttentionfor fast, memory-efficient multi-head causal self-attention.- Inputs: normalized embeddings.
- Outputs: contextually mixed representations.
Layer Normalization:
layer_norm1before attention.layer_norm2before feed-forward.- Improves training stability and convergence.
FeedForward Network:
- Applies a position-wise MLP to each token embedding.
- Adds non-linearity and increases model capacity.
Dropout:
- Applied after attention and feed-forward for regularization.
Residual Connections:
- Each sub-layer (attention, feed-forward) is wrapped with a skip connection to preserve input information and ease gradient flow.
Forward Pass¶
- Normalize input and apply attention, then dropout and residual add.
- Normalize again, apply feed-forward, then dropout and residual add.
- Output is ready for stacking in a transformer.
Usage¶
This block is designed to be stacked multiple times in a transformer model (see SydsGPTv2 in the next cell). It is compatible with the configuration dictionary used throughout the notebook.
No side effects or external dependencies beyond the imported modules.
from modules.LayerNorm import LayerNorm
from modules.FeedForward import FeedForward
import torch.nn as nn
class TransformerBlockv2(nn.Module):
def __init__(self, config):
super().__init__()
self.attention = FlashAttention(
embedding_dim = config["embedding_dim"],
num_heads = config["num_heads"],
dropout = config["dropout"],
)
self.layer_norm1 = LayerNorm(config["embedding_dim"])
self.feed_forward = FeedForward(config)
self.layer_norm2 = LayerNorm(config["embedding_dim"])
self.dropout = nn.Dropout(config["dropout"])
def forward(self, x):
shortcut = x
x = self.layer_norm1(x)
x = self.attention(x)
x = self.dropout(x)
x = x + shortcut
shortcut = x
x = self.layer_norm2(x)
x = self.feed_forward(x)
x = self.dropout(x)
x = x + shortcut
return x
SydsGPTv2 Model Definition: GPT-2 Style Transformer with FlashAttention Blocks¶
This code cell defines the SydsGPTv2 class, a GPT-style transformer model designed for efficient autoregressive language modeling. The architecture incorporates several key components:
Components¶
Token Embedding:
Maps input token IDs to dense vectors of sizeembedding_dim.Position Embedding:
Adds positional information to each token using learned embeddings for sequence positions up tocontext_length.Embedding Dropout:
Applies dropout to the sum of token and position embeddings for regularization.Stacked Transformer Blocks:
Uses a sequence ofTransformerBlockv2modules, each containing:- FlashAttention for fast, memory-efficient causal self-attention.
- LayerNorm and FeedForward sublayers with residual connections and dropout.
Final LayerNorm:
Normalizes the output of the last transformer block to stabilize training.Output Projection:
Projects the final hidden states to logits over the vocabulary for next-token prediction.
Forward Pass¶
Input Processing:
- Converts input token IDs to embeddings.
- Adds position embeddings.
- Applies dropout.
Transformer Stack:
- Passes the embeddings through a stack of
TransformerBlockv2modules.
- Passes the embeddings through a stack of
Output:
- Applies final layer normalization.
- Projects to vocabulary logits for language modeling.
Usage¶
This model is suitable for training and inference on large-scale text corpora. It leverages efficient attention mechanisms and modular design for scalability and performance.
No side effects or external dependencies beyond the imported modules.
import torch
import torch.nn as nn
from modules.TransformerBlock import TransformerBlock
from modules.LayerNorm import LayerNorm
class SydsGPTv2(nn.Module):
def __init__(self, config):
super().__init__()
self.token_embedding = nn.Embedding(config["vocab_size"], config["embedding_dim"])
self.position_embedding = nn.Embedding(config["context_length"], config["embedding_dim"])
self.drop_embedding = nn.Dropout(config["dropout"])
self.transformer_blocks = nn.Sequential(*[TransformerBlockv2(config) for _ in range(config["num_layers"])])
self.final_layer_norm = LayerNorm(config["embedding_dim"])
self.output_projection = nn.Linear(config["embedding_dim"], config["vocab_size"], bias = False)
def forward(self, input):
batch_size, seq_length = input.shape
token_embeddings = self.token_embedding(input)
position_embeddings = self.position_embedding(torch.arange(seq_length, device=input.device))
x = token_embeddings + position_embeddings
x = self.drop_embedding(x)
x = self.transformer_blocks(x)
x = self.final_layer_norm(x)
logits = self.output_projection(x)
return logits
SydsGPTv2 Model Configuration (164M Parameters)¶
This cell defines the configuration dictionary SYDSGPT_CONFIG_V2_164M for the SydsGPTv2 model, specifying architectural hyperparameters for a GPT-2 style transformer with approximately 164 million parameters.
Configuration Fields¶
vocab_size:
The size of the tokenizer vocabulary. For GPT-2, this is typically 50,257 tokens.context_length:
The maximum sequence length (number of tokens) the model can process in a single forward pass. Here, set to 2,048 tokens.embedding_dim:
The dimensionality of token and position embeddings, as well as the hidden states throughout the model. Set to 768, matching GPT-2 small/medium.num_heads:
The number of attention heads in each multi-head self-attention block. Set to 12.num_layers:
The number of stacked transformer blocks in the model. Set to 12.dropout:
Dropout probability applied throughout the model for regularization. Set to 0.1.qkv_bias:
Whether to use bias terms in the query/key/value projections. Set toFalsefor this configuration.
Usage¶
This configuration dictionary is passed to the SydsGPTv2 model constructor in the next cell, ensuring consistent architecture and hyperparameters for training and evaluation.
No computation or side effects occur in this cell; it strictly defines model hyperparameters.
SYDSGPT_CONFIG_V2_164M = {
"vocab_size" : 50257,
"context_length" : 2048,
"embedding_dim" : 768,
"num_heads" : 12,
"num_layers" : 12,
"dropout" : 0.1,
"qkv_bias" : False
}
SydsGPTv2 Model Instantiation and Device Placement¶
This cell performs the following steps to prepare the SydsGPTv2 model for training or inference:
Random Seed Initialization:
- Sets the PyTorch random seed to
246for reproducibility, ensuring consistent model weight initialization across runs.
- Sets the PyTorch random seed to
Model Instantiation:
- Constructs a new instance of the
SydsGPTv2model using the configuration dictionarySYDSGPT_CONFIG_V2_164M. - The configuration specifies key hyperparameters such as vocabulary size, context length, embedding dimension, number of heads/layers, dropout rate, and QKV bias.
- Constructs a new instance of the
Device Placement:
- Moves the model to the appropriate device (
cudaif a GPU is available, otherwisecpu) for efficient computation.
- Moves the model to the appropriate device (
Evaluation Mode:
- Sets the model to evaluation mode (
model.eval()), disabling dropout and other training-specific behaviors. - This is useful for inference or validation, but can be switched back to training mode (
model.train()) as needed.
- Sets the model to evaluation mode (
Why This Matters¶
Reproducibility:
Ensures that model initialization is consistent for debugging and experimentation.Performance:
Automatically utilizes available hardware acceleration (GPU) for faster computation.Readiness:
The model is fully instantiated and placed on the correct device, ready for forward passes, parameter inspection, or further fine-tuning.
No training or inference occurs in this cell; it strictly prepares the model and device context for downstream use.
torch.manual_seed(246)
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model = model.to(device)
model.eval()
SydsGPTv2(
(token_embedding): Embedding(50257, 768)
(position_embedding): Embedding(2048, 768)
(drop_embedding): Dropout(p=0.1, inplace=False)
(transformer_blocks): Sequential(
(0): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(3): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(4): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(5): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(6): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(7): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(8): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(9): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(10): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(11): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
(final_layer_norm): LayerNorm()
(output_projection): Linear(in_features=768, out_features=50257, bias=False)
)
Calculating and Displaying Total Parameters in SydsGPT V2 Model¶
This code cell computes the total number of trainable parameters in the instantiated SydsGPTv2 model and prints the result. This is an important diagnostic step for verifying the model’s scale and ensuring the architecture matches expectations.
What Happens in This Cell¶
Parameter Counting:
Iterates over all parameters in themodelobject and sums their element counts (numel()), which gives the total number of scalar weights and biases in the model.Output:
Prints the total parameter count in a human-readable format, allowing you to confirm the model’s size (e.g., for reporting, scaling experiments, or comparing with published architectures).
Why This Matters¶
Model Size Verification:
Ensures that the SydsGPTv2 model has been instantiated with the correct configuration and matches the intended parameter count (e.g., 164M for the provided config).Resource Planning:
Knowing the parameter count helps estimate memory requirements and training time.
No mutation or side effects occur; this cell is purely for inspection and reporting.
total_parameters = sum(parameter.numel() for parameter in model.parameters())
print(f"Total Parameters in SydsGPT V2 Model: {total_parameters}")
Total Parameters in SydsGPT V2 Model: 163823616
Model Compilation with torch.compile and Performance Optimization Flags¶
This cell compiles the SydsGPTv2 model using PyTorch’s torch.compile for accelerated training and inference. It also sets several performance-related flags to maximize throughput on compatible hardware.
Key Steps¶
Performance Flags (CUDA/CPU):
- Enables TensorFloat-32 (TF32) for matrix multiplications and cuDNN operations (Ampere+ GPUs).
- Activates cuDNN benchmarking for optimal kernel selection.
- Sets matmul precision to
'high'for improved numerical accuracy (if supported).
Model Compilation:
- Attempts to compile the model using
torch.compilewith the specified backend (inductor) and mode (default). - Supports dynamic shapes if needed (set via
dynamic_shapes). - Handles compilation errors gracefully, falling back to eager mode if compilation fails.
- Attempts to compile the model using
Diagnostics:
- Prints status messages indicating whether compilation succeeded and which backend/mode was used.
- Notes that the first iteration may include compilation overhead, but subsequent steps will be faster.
Why This Matters¶
- Speed: Compiling the model can significantly accelerate training and inference, especially on modern GPUs.
- Hardware Utilization: Performance flags ensure that the model leverages the fastest available kernels and precision modes.
- Robustness: The cell is designed to work on both CPU and GPU, and will not crash if certain features are unavailable.
No training or inference occurs in this cell; it strictly prepares the model for efficient execution in subsequent steps.
# Compile model with torch.compile and set performance flags
import torch
import contextlib
# Optional performance knobs (safe on Ampere+ GPUs; harmless on CPU)
try:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.backends.cudnn.benchmark = True
except Exception:
pass
# Prefer higher-precision matmul kernels if available on your hardware
with contextlib.suppress(Exception):
torch.set_float32_matmul_precision('high') # 'high' or 'medium'
# Choose a compile configuration
compile_backend = 'inductor' # default backend
compile_mode = 'default' # try 'reduce-overhead' or 'max-autotune' later
dynamic_shapes = False # set True if you plan to change batch size frequently
compile_ok = False
try:
model = torch.compile(model, backend=compile_backend, mode=compile_mode, dynamic=dynamic_shapes)
compile_ok = True
print(f"Model compiled with torch.compile (backend={compile_backend}, mode={compile_mode}, dynamic={dynamic_shapes})")
print("Note: First iteration includes compile time; subsequent steps are faster.")
except Exception as e:
print("torch.compile failed; falling back to eager. Error:\n", e)
Model compiled with torch.compile (backend=inductor, mode=default, dynamic=False) Note: First iteration includes compile time; subsequent steps are faster.
Learning Rate Schedule and Training Step Calculation¶
This cell sets up the learning rate schedule and computes the number of training steps per epoch for the SydsGPTv2 training loop. It defines three key learning rate values:
- Initial Learning Rate (
initial_lr): The starting learning rate for the warmup phase. - Peak Learning Rate (
peak_lr): The maximum learning rate reached after warmup. - Minimum Learning Rate (
min_lr): The lowest learning rate used during cosine decay, set to 10% of the peak.
It then calculates:
- Total Training Steps Per Epoch (
total_steps_per_epoch): The number of batches in one epoch, based on the size of the training dataset and batch size. - Warmup Steps (
warmup_steps): The number of steps over which the learning rate linearly increases frominitial_lrtopeak_lr, set to 2% of the steps per epoch.
All computed values are printed for verification. These parameters are used in the training loop to control learning rate scheduling and progress tracking.
initial_lr = 1e-6
peak_lr = 1e-4
min_lr = 0.1 * peak_lr
print('Initial LR:', initial_lr)
print('Peak LR:', peak_lr)
print('Min LR:', min_lr)
total_steps_per_epoch = len(train_hfds) // BATCH_SIZE
print('Total training steps per epoch:', total_steps_per_epoch)
warmup_steps = int(total_steps_per_epoch * .02)
print('Warmup steps:', warmup_steps)
Initial LR: 1e-06 Peak LR: 0.0001 Min LR: 1e-05 Total training steps per epoch: 1627200 Warmup steps: 32544
Advanced Training Loop: Cosine Decay, Warmup, Rotating Checkpoints¶
This cell defines train_model_v2, an advanced training loop for SydsGPTv2 with several key features:
Features¶
Cosine Decay + Warmup Learning Rate:
- Linearly increases LR from
initial_lrtopeak_lroverwarmup_steps. - After warmup, applies cosine decay from
peak_lrtomin_lrfor the remainder of training. - LR is updated per batch step.
- Linearly increases LR from
Rotating Checkpoints:
- Saves model and optimizer state every
checkpoint_intervalsteps. - Keeps last 2 checkpoints (
base_ckpt,prev1_ckpt,prev2_ckpt) for recovery.
- Saves model and optimizer state every
Periodic Evaluation:
- Evaluates on validation set every
evaluation_frequencysteps. - Logs validation loss and generates sample text.
- Evaluates on validation set every
Token Accounting:
- Tracks total tokens processed for reporting and scaling.
Inputs¶
model,training_dataloader,validation_dataloader,optimizer,device- Training hyperparameters:
num_epochs,evaluation_frequency,start_context,tokenizer,checkpoint_interval - LR schedule:
total_steps_per_epoch,warmup_steps,initial_lr,peak_lr,min_lr
Outputs¶
- Lists of training/validation losses, total tokens processed, learning rates
Usage¶
Call this function to train SydsGPTv2 with efficient scheduling, checkpointing. Suitable for large-scale distributed training and experimentation.
No side effects outside checkpoint files and console logging.
import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text
def train_model_v2(model, training_dataloader, validation_dataloader, optimizer, device,
num_epochs, evaluation_frequency, start_context,
tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr):
training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
total_tokens_processed, global_step = 0, -1
total_training_steps = num_epochs * total_steps_per_epoch
lr_increment = (peak_lr - initial_lr) / warmup_steps
for epoch in range(num_epochs):
model.train()
for batch in training_dataloader:
optimizer.zero_grad()
global_step += 1
if global_step < warmup_steps:
lr = initial_lr + global_step * lr_increment
else:
progress = (global_step - warmup_steps) / (total_training_steps - warmup_steps)
lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
for param_group in optimizer.param_groups:
param_group['lr'] = lr
learning_rates.append(lr)
loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
loss.backward()
if global_step >= warmup_steps:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm = 1.0)
optimizer.step()
training_losses.append(loss.item())
total_tokens_processed += (batch['input_ids'] != -100).sum().item()
print(f"Epoch {epoch + 1}, Step {global_step}: Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item()}")
if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
model.eval()
val_batch = next(iter(validation_dataloader))
with torch.no_grad():
val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
validation_losses.append(val_loss.item())
print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item()} ---")
generate_sample_text(model, tokenizer, device, start_context)
model.train()
if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"
try:
if os.path.exists(prev1_ckpt):
os.remove(prev1_ckpt)
except Exception:
pass
try:
if os.path.exists(base_ckpt):
os.replace(base_ckpt, prev1_ckpt)
except Exception:
pass
torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt}")
return training_losses, validation_losses, total_tokens_processed, learning_rates
GaLoreAdamW Optimizer Instantiation for SydsGPTv2¶
This cell initializes the optimizer for training the SydsGPTv2 model using the GaLoreAdamW optimizer from the galore_torch library. GaLoreAdamW is an efficient AdamW variant that leverages low-rank gradient updates to reduce memory usage and accelerate training, making it suitable for large-scale transformer models.
What Happens in This Cell¶
Import:
ImportsGaLoreAdamWfrom thegalore_torchpackage.Optimizer Setup:
Instantiates the optimizer with the model’s parameters and a weight decay of0.05for regularization.model.parameters(): Supplies all trainable parameters of SydsGPTv2.weight_decay=0.05: Applies L2 regularization to help prevent overfitting.
Why Use GaLoreAdamW?¶
Memory Efficiency:
Reduces optimizer state memory footprint, enabling training of larger models or bigger batches.Performance:
Maintains AdamW’s adaptive learning rate and weight decay benefits while optimizing for speed and scale.Compatibility:
Drop-in replacement for standard AdamW; integrates seamlessly with PyTorch training loops.
Usage¶
The resulting optimizer object is used in subsequent training cells to update model weights during backpropagation.
No training or mutation occurs in this cell; it strictly prepares the optimizer for downstream use.
from galore_torch import GaLoreAdamW
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
Training SydsGPTv2: Cosine Decay, Warmup, Evaluation, and Rotating Checkpoints¶
This cell launches the training loop for SydsGPTv2 using the advanced train_model_v2 function. It orchestrates the following:
Features¶
Cosine Decay + Warmup Learning Rate:
- Starts with a low initial learning rate (
initial_lr), linearly increases to a peak (peak_lr) overwarmup_steps, then decays to a minimum (min_lr) using a cosine schedule. - Learning rate is updated every batch.
- Starts with a low initial learning rate (
Rotating Checkpoints:
- Saves model and optimizer state every
checkpoint_intervalsteps. - Keeps the last two checkpoints for recovery (
autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pthand its previous version).
- Saves model and optimizer state every
Periodic Evaluation:
- Every
evaluation_frequencysteps, evaluates on the validation set and prints validation loss. - Generates sample text using the current model state for qualitative inspection.
- Every
Token Accounting:
- Tracks the total number of tokens processed for reporting and scaling.
Inputs¶
model: SydsGPTv2 instance, already moved to the correct device.training_dataloader,validation_dataloader: PyTorch DataLoaders for train/val splits.optimizer: GaLoreAdamW optimizer for efficient memory usage.device: CUDA or CPU device.num_epochs: Number of epochs to train.evaluation_frequency: Steps between validation/evaluation.start_context: Initial prompt for sample text generation.tokenizer: GPT-2 tokenizer (enc).checkpoint_interval: Steps between checkpoint saves.total_steps_per_epoch,warmup_steps,initial_lr,peak_lr,min_lr: Learning rate schedule parameters.
Outputs¶
training_losses: List of training loss values per step.validation_losses: List of validation loss values at evaluation intervals.total_tokens_processed: Total tokens seen during training.learning_rates: List of learning rates used per step.
Usage¶
This cell is the main entry point for model training. It provides robust scheduling, checkpointing, and evaluation, suitable for large-scale experiments and recovery from interruptions.
No side effects outside checkpoint files and console logging.
num_epochs = 2
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v2(
model,
training_dataloader,
validation_dataloader,
optimizer,
device,
num_epochs,
evaluation_frequency = 10000,
start_context = "Once upon a time",
tokenizer = enc,
checkpoint_interval = 10000,
total_steps_per_epoch = total_steps_per_epoch,
warmup_steps = warmup_steps,
initial_lr = initial_lr,
peak_lr = peak_lr,
min_lr = min_lr
)
Model Loading, Checkpoint Restoration, and Device Placement for SydsGPTv2¶
This cell performs the following steps to restore a previously trained SydsGPTv2 model and optimizer state for further training or inference:
Imports and Device Setup:
- Imports the
GaLoreAdamWoptimizer from thegalore_torchpackage. - Detects the available device (
cudaif a GPU is present, otherwisecpu) and prints the device being used.
- Imports the
Model Instantiation and Checkpoint Loading:
- Instantiates a new SydsGPTv2 model using the configuration dictionary
SYDSGPT_CONFIG_V2_164M. - Loads the model weights from the latest rotating checkpoint file (
autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth), mapping tensors to the detected device. - Loads the model state dictionary into the model instance.
- Instantiates a new SydsGPTv2 model using the configuration dictionary
Optimizer Restoration and Device Placement:
- Instantiates the
GaLoreAdamWoptimizer with the model’s parameters and a weight decay of0.05. - Moves all optimizer state tensors to the correct device to ensure compatibility with the model.
- Moves the model itself to the detected device.
- Instantiates the
Why This Matters¶
Checkpoint Recovery:
Enables seamless resumption of training or evaluation from the last saved state, preserving both model weights and optimizer momentum.Device Consistency:
Ensures that all tensors (model and optimizer) are placed on the same device, avoiding runtime errors and maximizing performance.Experiment Continuity:
Facilitates iterative experimentation, fine-tuning, or evaluation without retraining from scratch.
No training or inference occurs in this cell; it strictly restores model and optimizer state for downstream use.
from galore_torch import GaLoreAdamW
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
checkpoint = torch.load("autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth", map_location=device)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
for state in optimizer.state.values():
for k, v in state.items():
if isinstance(v, torch.Tensor):
state[k] = v.to(device)
model.to(device)
Using device: cuda
e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\galore_torch\adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn(
SydsGPTv2(
(token_embedding): Embedding(50257, 768)
(position_embedding): Embedding(2048, 768)
(drop_embedding): Dropout(p=0.1, inplace=False)
(transformer_blocks): Sequential(
(0): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(3): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(4): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(5): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(6): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(7): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(8): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(9): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(10): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(11): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
(final_layer_norm): LayerNorm()
(output_projection): Linear(in_features=768, out_features=50257, bias=False)
)
Saving SydsGPTv2 Model Weights to Disk (5.1 Billion Tokens Trained)¶
This cell saves the current state dictionary of the SydsGPTv2 model to disk as "sydsgpt_v2_164m_trained_model-5.1B.pth". This checkpoint represents the model after training on approximately 5.1 billion tokens.
What Happens in This Cell¶
Model Serialization:
Usestorch.saveto serialize the model’s parameters (state_dict) to a file. This allows for later restoration, fine-tuning, or inference without retraining.File Naming Convention:
The filename includes the model type, parameter count (164M), and the number of tokens processed (5.1B), making it easy to track training progress and checkpoint lineage.
Why Save Model Weights?¶
Experiment Tracking:
Preserves the model state at a specific training milestone for reproducibility and comparison.Recovery & Deployment:
Enables resuming training, performing evaluation, or deploying the model for inference.Version Control:
Facilitates managing multiple checkpoints corresponding to different stages of training.
No side effects occur beyond writing the checkpoint file to disk.
torch.save(model.state_dict(), "sydsgpt_v2_164m_trained_model-5.1B.pth")
Model Loading and Inference: SydsGPTv2 with 5.1B Token Checkpoint¶
This cell demonstrates how to restore a previously trained SydsGPTv2 model from disk and perform text generation using the loaded weights. The workflow includes:
Imports and Device Setup
- Imports the
GaLoreAdamWoptimizer fromgalore_torch. - Detects the available device (
cudaif a GPU is present, otherwisecpu) and prints the device being used.
- Imports the
Model Instantiation and Checkpoint Loading
- Instantiates a new SydsGPTv2 model using the configuration dictionary
SYDSGPT_CONFIG_V2_164M. - Loads the model weights from the
"sydsgpt_v2_164m_trained_model-5.1B.pth"checkpoint, mapping tensors to the detected device. - Instantiates the
GaLoreAdamWoptimizer with the model’s parameters and a weight decay of0.05. - Moves the model to the detected device.
- Instantiates a new SydsGPTv2 model using the configuration dictionary
Usage
- The restored model is ready for further training, evaluation, or inference.
- This cell is typically followed by text generation or validation steps.
Why This Matters¶
- Checkpoint Recovery: Enables seamless resumption of training or inference from a specific milestone.
- Device Consistency: Ensures all tensors are placed on the correct device for efficient computation.
- Experiment Continuity: Facilitates iterative experimentation and deployment without retraining from scratch.
No training or inference occurs in this cell; it strictly restores model and optimizer state for downstream use.
from galore_torch import GaLoreAdamW
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt_v2_164m_trained_model-5.1B.pth", map_location=device))
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.05)
model.to(device)
Using device: cuda
e:\Code\SydsGPT-Pretraining-LargeDS\.venv\Lib\site-packages\galore_torch\adamw.py:48: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning warnings.warn(
SydsGPTv2(
(token_embedding): Embedding(50257, 768)
(position_embedding): Embedding(2048, 768)
(drop_embedding): Dropout(p=0.1, inplace=False)
(transformer_blocks): Sequential(
(0): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(3): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(4): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(5): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(6): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(7): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(8): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(9): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(10): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(11): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
(final_layer_norm): LayerNorm()
(output_projection): Linear(in_features=768, out_features=50257, bias=False)
)
Text Generation with SydsGPTv2: Sampling from a Trained Model¶
This cell demonstrates how to generate text using the SydsGPTv2 model and a GPT-2 tokenizer. The workflow includes:
Imports and Tokenizer Setup
- Imports the
generate,text_to_tokens, andtokens_to_textfunctions from themodules.Generatemodule. - Initializes the GPT-2 tokenizer using the
tiktokenlibrary.
- Imports the
Input Preparation
- Defines an input prompt:
"Once upon a time there was a kingdom far away where". - Converts the input text to token IDs using the tokenizer and moves them to the appropriate device (CPU or GPU).
- Defines an input prompt:
Text Generation
- Calls the
generatefunction to sample 200 new tokens from the model, using a context length of 2048, a temperature of 1.5 (for more creative outputs), and top-k sampling with k=40 (restricts sampling to the top 40 probable tokens at each step).
- Calls the
Output Decoding and Display
- Converts the generated token IDs back to human-readable text.
- Prints the generated output for inspection.
Why This Matters¶
Qualitative Evaluation:
Enables rapid inspection of the model’s generative capabilities after training or fine-tuning.Sampling Controls:
Temperature and top-k parameters allow for tuning creativity and diversity in generated text.End-to-End Demonstration:
Shows the complete process from prompt to generated output, suitable for inference, validation, or deployment.
No training or mutation occurs in this cell; it strictly performs inference and displays the result.
from modules.Generate import generate, text_to_tokens, tokens_to_text
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
input_text = "Once upon a time there was a kingdom far away where"
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(model, input_tokens, 200, SYDSGPT_CONFIG_V2_164M['context_length'], temperature = 1.5, top_k = 40)
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text:\n {output_text}")
Output Text: Once upon a time there was a kingdom far away where the men were not allowed away when a time a God wanted for their own, as long as the sons did away the children in the church. If the Holy Spirit were against them they would go to bed at all costs before they came to bed and have their daily breaded supper for their father or father to drink before they had left. (As many of the more people of these places had died while sleeping in the bed with them that day before their next feast.) Now in 1878, a small minority church was officially created in order to provide good health on sickness, although a few small groups had become the majority of the society at this time. For good example, from 1925 as the outbreak of plague hit, from 1915, there went some to work out at one-way hospitals but did not work out at both first and then to a third, then, finally the death had happened and so took three more. Then, from 1931 to 1944 and then from 1944 through 1958 at this
Advanced Training Loop with Gradient Accumulation: train_model_v3¶
This cell defines train_model_v3, an enhanced training loop for SydsGPTv2 that supports gradient accumulation in addition to cosine decay learning rate scheduling, warmup, periodic evaluation, and rotating checkpoints.
Key Features¶
Gradient Accumulation:
- Allows effective batch sizes larger than GPU memory by accumulating gradients over multiple mini-batches (
grad_accum_steps). - Scales loss by
1/grad_accum_stepsbefore backward to keep gradients invariant. - Performs optimizer step and gradient zeroing only after the specified number of accumulation steps.
- Allows effective batch sizes larger than GPU memory by accumulating gradients over multiple mini-batches (
Cosine Decay + Warmup Learning Rate:
- Linearly increases LR from
initial_lrtopeak_lroverwarmup_steps. - Applies cosine decay from
peak_lrtomin_lrfor the remainder of training. - LR is updated per batch step.
- Linearly increases LR from
Rotating Checkpoints:
- Saves model and optimizer state every
checkpoint_intervalsteps. - Keeps the last three checkpoints (
base_ckpt,prev1_ckpt,prev2_ckpt) for robust recovery.
- Saves model and optimizer state every
Periodic Evaluation:
- Evaluates on the validation set every
evaluation_frequencysteps. - Logs validation loss and generates sample text for qualitative inspection.
- Evaluates on the validation set every
Token Accounting:
- Tracks total tokens processed for reporting and scaling.
Inputs¶
model,training_dataloader,validation_dataloader,optimizer,device- Training hyperparameters:
num_epochs,evaluation_frequency,start_context,tokenizer,checkpoint_interval - LR schedule:
total_steps_per_epoch,warmup_steps,initial_lr,peak_lr,min_lr grad_accum_steps: Number of mini-batches to accumulate before optimizer step
Outputs¶
- Lists of training/validation losses, total tokens processed, learning rates
Usage¶
Call this function to train SydsGPTv2 with efficient scheduling, checkpointing, and large effective batch sizes via gradient accumulation. Suitable for large-scale distributed training and experimentation.
No side effects outside checkpoint files and console logging.
import math
import os
from modules.Loss import calc_batch_loss
from modules.Generate import generate_sample_text
def train_model_v3(model, training_dataloader, validation_dataloader, optimizer, device,
num_epochs, evaluation_frequency, start_context,
tokenizer, checkpoint_interval, total_steps_per_epoch, warmup_steps, initial_lr, peak_lr, min_lr,
grad_accum_steps: int = 1):
"""
Train with cosine decay + warmup and optional gradient accumulation.
Notes:
- LR/warmup here are updated per batch (DataLoader iteration). If you want warmup
in optimizer steps, compute warmup_steps accordingly (divide by grad_accum_steps).
- loss is scaled by 1/grad_accum_steps before backward to keep gradients invariant.
"""
training_losses, validation_losses, total_tokens_processed, learning_rates = [], [], [], []
total_tokens_processed, global_step = 0, -1
total_training_steps = num_epochs * total_steps_per_epoch
lr_increment = (peak_lr - initial_lr) / max(1, warmup_steps)
accum_counter = 0
optimizer.zero_grad(set_to_none=True)
for epoch in range(num_epochs):
model.train()
for batch in training_dataloader:
global_step += 1
# Learning rate schedule per batch step
if global_step < warmup_steps:
lr = initial_lr + global_step * lr_increment
else:
progress = (global_step - warmup_steps) / max(1, (total_training_steps - warmup_steps))
lr = min_lr + 0.5 * (peak_lr - min_lr) * (1 + math.cos(math.pi * progress))
for pg in optimizer.param_groups:
pg['lr'] = lr
learning_rates.append(lr)
# Forward + backward (accumulated)
loss = calc_batch_loss(batch['input_ids'], batch['targets'], model, device)
training_losses.append(loss.item()) # log unscaled loss
(loss / max(1, grad_accum_steps)).backward()
accum_counter += 1
# Token accounting (per batch)
total_tokens_processed += (batch['input_ids'] != -100).sum().item()
did_optimizer_step = False
if accum_counter % max(1, grad_accum_steps) == 0:
if global_step >= warmup_steps:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
did_optimizer_step = True
print(f"Epoch {epoch + 1}, Step {global_step} ({'opt-step' if did_optimizer_step else 'accumulating'}): Tokens Processed = {total_tokens_processed}, Training Loss = {loss.item():.4f}, LR = {lr:.2e}")
# Periodic evaluation
if global_step >= evaluation_frequency and global_step % evaluation_frequency == 0:
model.eval()
try:
val_batch = next(iter(validation_dataloader))
with torch.no_grad():
val_loss = calc_batch_loss(val_batch['input_ids'], val_batch['targets'], model, device)
validation_losses.append(val_loss.item())
print(f"--- Evaluation at Epoch {epoch + 1}, Step {global_step}: Validation Loss = {val_loss.item():.4f} ---")
generate_sample_text(model, tokenizer, device, start_context)
except StopIteration:
print("Validation loader empty; skipping eval.")
finally:
model.train()
# Checkpoint rotation (keep last 2)
if global_step >= checkpoint_interval and global_step % checkpoint_interval == 0:
base_ckpt = "autosave_ckpt1_sydsgpt_v2_164m_trained_model_optimizer.pth"
prev1_ckpt = "autosave_ckpt1_prev1_sydsgpt_v2_164m_trained_model_optimizer.pth"
prev2_ckpt = "autosave_ckpt1_prev2_sydsgpt_v2_164m_trained_model_optimizer.pth"
try:
if os.path.exists(prev2_ckpt):
os.remove(prev2_ckpt)
except Exception:
pass
try:
if os.path.exists(prev1_ckpt):
os.replace(prev1_ckpt, prev2_ckpt)
except Exception:
pass
try:
if os.path.exists(base_ckpt):
os.replace(base_ckpt, prev1_ckpt)
except Exception:
pass
torch.save({"model_state_dict": model.state_dict(), "optimizer_state_dict": optimizer.state_dict()}, base_ckpt)
print(f"Checkpoint saved (rotating): {base_ckpt} | prev1 -> {prev1_ckpt} | prev2 -> {prev2_ckpt}")
# Flush leftover grads at epoch end (if any)
if accum_counter % max(1, grad_accum_steps) != 0:
if global_step >= warmup_steps:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
print("Flushed leftover accumulated gradients at epoch end.")
return training_losses, validation_losses, total_tokens_processed, learning_rates
Training SydsGPTv2 with Gradient Accumulation: One Epoch, Large Effective Batch Size¶
This cell launches the advanced training loop for SydsGPTv2 using the train_model_v3 function, which supports gradient accumulation for large effective batch sizes. The workflow includes:
Epoch and Gradient Accumulation Setup
- Sets
num_epochs = 1for a single epoch of training. - Sets
grad_accum_steps = 64, meaning gradients are accumulated over 64 mini-batches before each optimizer step. This increases the effective batch size toBATCH_SIZE * grad_accum_steps, enabling training with larger batches than fit in GPU memory.
- Sets
Training Loop Invocation
- Calls
train_model_v3with all required arguments:model,training_dataloader,validation_dataloader,optimizer,device- Learning rate schedule parameters:
initial_lr,peak_lr,min_lr,warmup_steps,total_steps_per_epoch - Evaluation and checkpoint intervals: every 10,000 steps
start_context: Initial prompt for sample text generationtokenizer: GPT-2 tokenizer (enc)grad_accum_steps: Number of mini-batches to accumulate before optimizer step
- Calls
Features of
train_model_v3- Cosine Decay + Warmup Learning Rate: Linearly increases LR during warmup, then applies cosine decay.
- Gradient Accumulation: Scales loss and accumulates gradients, performing optimizer step only after
grad_accum_steps. - Rotating Checkpoints: Saves model and optimizer state every 10,000 steps, keeping the last three checkpoints for recovery.
- Periodic Evaluation: Evaluates on validation set and generates sample text every 10,000 steps.
- Token Accounting: Tracks total tokens processed for reporting.
Outputs
- Returns lists of training and validation losses, total tokens processed, and learning rates used per step.
Why Use Gradient Accumulation?¶
- Memory Efficiency: Enables training with very large effective batch sizes, even on limited GPU memory.
- Stability: Larger batches can improve gradient estimates and training stability.
- Scalability: Facilitates scaling experiments and matches distributed training setups.
No side effects occur outside checkpoint files and console logging. This cell is the main entry point for large-batch training with robust scheduling and checkpointing.
num_epochs = 1
grad_accum_steps = 64 # effective batch = BATCH_SIZE * grad_accum_steps
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v3(
model,
training_dataloader,
validation_dataloader,
optimizer,
device,
num_epochs,
evaluation_frequency = 10000,
start_context = "Once upon a time",
tokenizer = enc,
checkpoint_interval = 10000,
total_steps_per_epoch = total_steps_per_epoch,
warmup_steps = warmup_steps,
initial_lr = initial_lr,
peak_lr = peak_lr,
min_lr = min_lr,
grad_accum_steps = grad_accum_steps
)
Saving SydsGPTv2 Model Weights to Disk (11.8 Billion Tokens Trained)¶
This cell saves the current state dictionary of the SydsGPTv2 model to disk as "sydsgpt_v2_164m_trained_model-11.8B.pth". This checkpoint represents the model after training on approximately 11.8 billion tokens.
What Happens in This Cell¶
Model Serialization:
Usestorch.saveto serialize the model’s parameters (state_dict) to a file. This allows for later restoration, fine-tuning, or inference without retraining.File Naming Convention:
The filename includes the model type, parameter count (164M), and the number of tokens processed (11.8B), making it easy to track training progress and checkpoint lineage.
Why Save Model Weights?¶
Experiment Tracking:
Preserves the model state at a specific training milestone for reproducibility and comparison.Recovery & Deployment:
Enables resuming training, performing evaluation, or deploying the model for inference.Version Control:
Facilitates managing multiple checkpoints corresponding to different stages of training.
No side effects occur beyond writing the checkpoint file to disk.
torch.save(model.state_dict(), "sydsgpt_v2_164m_trained_model-11.8B.pth")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt_v2_164m_trained_model-11.8B.pth", map_location=device))
model.to(device)
Using device: cuda
SydsGPTv2(
(token_embedding): Embedding(50257, 768)
(position_embedding): Embedding(2048, 768)
(drop_embedding): Dropout(p=0.1, inplace=False)
(transformer_blocks): Sequential(
(0): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(1): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(2): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(3): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(4): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(5): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(6): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(7): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(8): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(9): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(10): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
(11): TransformerBlockv2(
(attention): FlashAttention(
(qkv): Linear(in_features=768, out_features=2304, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(layer_norm1): LayerNorm()
(feed_forward): FeedForward(
(layers): Sequential(
(0): Linear(in_features=768, out_features=3072, bias=True)
(1): GELU()
(2): Linear(in_features=3072, out_features=768, bias=True)
)
)
(layer_norm2): LayerNorm()
(dropout): Dropout(p=0.1, inplace=False)
)
)
(final_layer_norm): LayerNorm()
(output_projection): Linear(in_features=768, out_features=50257, bias=False)
)
from modules.Generate import generate, text_to_tokens, tokens_to_text
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
input_text = "A deep neural network is a type of artificial neural network with multiple layers between the input and output layers, which allows it to learn hierarchical patterns in data."
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(model, input_tokens, 1000, SYDSGPT_CONFIG_V2_164M['context_length'], temperature = 1.5, top_k = 40)
output_text = tokens_to_text(output_tokens, tokenizer)
print(f"Output Text:\n {output_text}")
Output Text: Once upon a time there was a kingdom far away where a man lived! So now if we are a nation that believes the things in their flesh the people will change their destiny in their hands.” And what do children who feel safe go to seek out more good souls have faith and want to learn to grow in our bodies and our bodies.” There are numerous spiritual leaders at the moment that try to see us that may give us hope and desire, and we can begin to believe in others. We believe in our worth. God speaks up in our souls and our minds to tell them to take us first and to make our souls stronger and fuller. When someone begins asking this question on his own or her mind he tells them he knows that he believes with his heart in his faith. If he answers this question for the first time he is open and content, but also open and not content on wanting his heart for his soul as there could be more than hope. There are two common ways to find spiritual growth in our bodies that make these soul renews. 1 John 2.15 (I will give up only to one who thinks that all the people of heaven shall have such thoughts so I will never take no away that I will give up my mind.) 2. John 1.28 is another great source of spiritual learning for a true Christian with a faith to trust in the Lord Himself. God said. “As was taught in your holy books” (Ps. 8:6b) ‘No man shall possess any spirit among the Gentiles’ (II Corinthians 9:31-33). He gave a message to his enemies and was instructed according to an invitation that was given at Pentecost (the day the Lord spoke before his first wife, Ephesians); this he had said. It is also called the ‘Father,” (Ps. 25). In this passage Jesus told the children to read by a book because they feared that they had a good spiritual state; he said that they didn’t have enough spirit to understand that this meant the work of a perfect God. ”For we will be in an environment of confusion, confusion is not of the Bible” (Gen. 1, 11). His message comes as he told his sons: If we keep everything separate and not keep what the sons of men thought of in an empty land it will not be to give God the kingdom of heaven above us” or so he tells his sons to remember that the good will to make them believe. A believer may still remain in this world, it is with faith when our flesh does not give itself and this is a good time in life when our faith is broken . There God would know when anyone will trust that God will provide for him as well. At this point I had only said this in a previous work I had before. So let me take on it: Let the Lord help us keep and renew God” (Ps. 8). Notice this: We go back to a few pages here and read for our daily activities that take us beyond those of the living Word, His teachings are still important but relevant, there are numerous others that come into touch, in this book Jesus gives the same answers to each believer. He says it in Scripture: We must stay and renew every word; and if you do that you might learn that it is the right way to say that He will come into the world. We all must think carefully before us. God’s help us continually as there is always in our life, the will of His people is His name to give all the words and knowledge we have. Ps. 6.5 And He says: All things are to keep your mouth from my mouth and into the eye” (Ps. 1-10) He told the children to study before the Lord that He should keep you from your mouth; if you keep everything separate the word, your eyes and will know that it is our first way to the word he said that everything is for my mouth. Then Jesus said to his children the same saying to him. These children know that “the one who will give me first and my child and my home.” The children must now be used to live as you tell them: The Father and the Mother will be your enemies at heaven. The Child’s words make me want in prayer to hear your words and to make them believe it will bring your life; let me teach the two to speak from it. These Wordings will not always result in an end as He told the children in this work; but because He gave them in a way as if on every page it will bring them on to the Lord. If the Spirit, by His Holy Presence and His Word does not see you there will be great need for you will remember that it always gives our first and for this we may take part in it. At these things as is always said

Leave a Reply