Supervised Instruction Fine-Tuning on Alpaca, Deployment, and Why 164M Isn’t Enough

In Part 7, I pretrained a 164M parameter GPT-style model (SydsGPTv2) on ~12B tokens using a carefully engineered pipeline on a single NVIDIA 3080 Ti. In this final part of the series, I shift from pure pretraining to instruction fine-tuning (SFT).

The goals for this phase were:
  • Start from the pretrained SydsGPTv2 checkpoint.
  • Fine-tune it on the Alpaca GPT-4 instruction dataset.
  • Evaluate its instruction-following behavior.
  • Deploy it via Hugging Face Hub + Gradio as a live demo.
  • Reflect honestly on what a 164M parameter model can and cannot do in this regime.

I’ll walk through the full notebook: data download and formatting, dataset and dataloaders, collate function, model/optimizer setup, training loops, loss visualization, evaluation, and generation — including all code.

1. Setup: Imports and High-Level Workflow

This notebook is focused on fine-tuning, not pretraining. At a high level, the workflow is:

  • Load and preprocess Alpaca.
  • Tokenize and batch data for efficient training.
  • Load the pretrained SydsGPTv2 model.
  • Fine-tune with a warmup + cosine LR schedule and checkpointing.
  • Visualize training/validation losses.
  • Evaluate and generate outputs.

Basic PyTorch imports:

import torch
import torch.nn as nn

2. Downloading and Loading the Alpaca Dataset

I started by defining a small utility to download the dataset from a URL if it’s not already cached locally.

import json
import urllib
import os

def download_data(url, path):
    if not os.path.exists(path):
        with urllib.request.urlopen(url) as response:
            raw_data = response.read().decode('utf-8')
        with open(path, 'w', encoding = 'utf-8') as f:
            f.write(raw_data)
        print(f"Dataset downloaded and saved to {path}")
    else:
        print(f"Dataset already exists at {path}")
    
    with open(path, 'r') as f:
        data = json.load(f)
    return data

Then I downloaded and inspected the Alpaca GPT-4 data:

url = "https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/refs/heads/main/data/alpaca_gpt4_data.json"
path = "data/alpaca_gpt4_data.json"
data = download_data(url, path)

print(f"Loaded {len(data)} records from the dataset.")
print("Sample record:", data[25])

3. Train/Test/Validation Split

I split the dataset into train, test, and validation:

  • 85% train
  • 10% test
  • 5% validation
training_data = data[:int(len(data)*0.85)]
test_data = data[int(len(data)*0.85):int(len(data)*0.95)]
validation_data = data[int(len(data)*0.95):]

print(f"Training records: {len(training_data)}")
print(f"Test records: {len(test_data)}")
print(f"Validation records: {len(validation_data)}")

4. Formatting Records into Prompts

Instruction tuning lives and dies on the prompt format. For this, I used simple <|user|> and <|assistant|> tags.

Formatting a single Alpaca record into a user prompt:

def format_record(record):
    return f"<|user|>\n{record['instruction']}" + (f"\n{record['input']}" if record['input'] else "")

Example of combining prompt and response:

input = format_record(data[50])
response = f"\n\n<|assistant|>\n{data[50]['output']}"

print(input + response)

This is the core pattern used for training and evaluation.

5. Instruction Dataset: Encoding for SFT

I created a custom InstructionDataset that encodes <|user|> ... <|assistant|> ... pairs using tiktoken:

from torch.utils.data import Dataset

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_text = []
        for record in data:
            text = format_record(record) + f"\n\n<|assistant|>\n{record['output']}"
            self.encoded_text.append(tokenizer.encode(text))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.encoded_text[idx]

6. Tokenizer Initialization and Padding Token

I reused the GPT-2 tokenizer from tiktoken and used its end-of-text token as padding:

import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
padding_token = tokenizer.eot_token
print(f"Padding token: {padding_token}")

7. Custom Collate Function for Instruction Batches

The collate function handles:

  • Padding sequences in a batch to the maximum length.
  • Creating input and target sequences shifted by one.
  • Masking padding tokens in targets with an ignore_idx (e.g. -100) so they don’t contribute to the loss.
  • Truncating to a fixed context_size (2048).
  • Moving tensors to the correct device.
def instruction_collate_fn(batch, padding_token, ignore_idx, device, context_size):
    max_length = max(len(item) for item in batch)
    input_ids, target_ids = [], []

    for item in batch:
        padded_item = item + [padding_token] * (max_length - len(item))
        inputs = torch.tensor(padded_item)
        targets = torch.tensor(padded_item[1:] + [padding_token])
        mask = targets == padding_token
        idxs = torch.nonzero(mask).squeeze()
        if idxs.numel() > 1:
            targets[idxs[1:]] = ignore_idx
        
        if context_size is not None:
            inputs = inputs[:context_size]
            targets = targets[:context_size]
        
        input_ids.append(inputs)
        target_ids.append(targets)

        
    input_ids = torch.stack(input_ids).to(device)
    target_ids = torch.stack(target_ids).to(device)
    return input_ids, target_ids

I also sanity-checked the collate function with toy data:

test_inputs1 = [1,2,3,4,5,6,7,8]
test_inputs2 = [9,10,11]
test_inputs3 = [12,13,14,15]
batch = [test_inputs1, test_inputs2, test_inputs3]
padded_input_batch, padded_target_batch = instruction_collate_fn(batch, padding_token=padding_token, ignore_idx=-100, device='cpu', context_size=2048)
print(padded_input_batch)
print(padded_target_batch)

8. Device Selection

Standard GPU/CPU selection:

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

9. DataLoaders for Train/Val/Test

I created a partial of the collate function with fixed parameters

from functools import partial
collate_fn = partial(instruction_collate_fn, padding_token=padding_token, ignore_idx=-100, device=device, context_size=2048)

Then set up dataloaders:

from torch.utils.data import DataLoader

num_workers = 0
batch_size = 2

train_dataset = InstructionDataset(training_data, tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size = batch_size, collate_fn = collate_fn, num_workers = num_workers, shuffle = True, drop_last = True)

validation_dataset = InstructionDataset(validation_data, tokenizer)
validation_dataloader = DataLoader(validation_dataset, batch_size = batch_size, collate_fn = collate_fn, num_workers = num_workers, shuffle = False, drop_last = False)

test_dataset = InstructionDataset(test_data, tokenizer)
test_dataloader = DataLoader(test_dataset, batch_size = batch_size, collate_fn = collate_fn, num_workers = num_workers, shuffle = False, drop_last = False)

Quick inspection of shapes:

print("Train Loader:")
for i, (inputs, targets) in enumerate(train_dataloader):
    print(inputs.shape, targets.shape)
    if i == 4:
        break

10. Model Configuration and Initialization

I used the same configuration as in Part 7

SYDSGPT_CONFIG_V2_164M = {
    "vocab_size" : 50257,
    "context_length" : 2048,
    "embedding_dim" : 768,
    "num_heads" : 12,
    "num_layers" : 12,
    "dropout" : 0.1,
    "qkv_bias" : False
}

Then loaded the pretrained model and optimizer:

from galore_torch import GaLoreAdamW
from model.SydsGPTv2 import SydsGPTv2

model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt/sydsgpt_v2_164m_trained_model-11.8B.pth", map_location=device))
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.01)
model.to(device)

11. Learning Rate Schedule: Warmup + Cosine

For fine-tuning I used small LRs

initial_lr = 5e-6
peak_lr = 2e-5
min_lr = 0.1 * peak_lr

print('Initial LR:', initial_lr)
print('Peak LR:', peak_lr)
print('Min LR:', min_lr)

total_steps_per_epoch = len(train_dataloader)
print('Total training steps per epoch:', total_steps_per_epoch)

warmup_steps = int(total_steps_per_epoch * .03)
print('Warmup steps:', warmup_steps)

The actual schedule logic lives in train_model_v2 (from modules.Training), which applies warmup then cosine decay over the total training horizon.

12. Fine-Tuning: First 2 Epochs

I kicked off the first stage of fine-tuning for 2 epochs

from modules.Training import train_model_v2

num_epochs = 2
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v2(
    model,
    train_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 2000,
    start_context = format_record(validation_data[0]),
    tokenizer = tokenizer,
    checkpoint_interval = 2000,
    total_steps_per_epoch = total_steps_per_epoch,
    warmup_steps = warmup_steps,
    initial_lr = initial_lr,
    peak_lr = peak_lr,
    min_lr = min_lr
)

I saved the model after 2 epochs:

torch.save(model.state_dict(), "sydsgpt/sydsgpt_v2_164m_finetuned_alpaca_2epochs.pth")

13. Extended Fine-Tuning: Up to 6 and 10 Epochs

Later, I saved a 6-epoch checkpoint:

torch.save(model.state_dict(), "sydsgpt/sydsgpt_v2_164m_finetuned_alpaca_6epochs.pth")

Then I reinitialized the optimizer and trained for 4 more epochs (total 10):

from galore_torch import GaLoreAdamW
optimizer = GaLoreAdamW(model.parameters(), weight_decay=0.01)
from modules.Training import train_model_v2

num_epochs = 4
training_losses, validation_losses, total_tokens_processed, learning_rates = train_model_v2(
    model,
    train_dataloader,
    validation_dataloader,
    optimizer,
    device,
    num_epochs,
    evaluation_frequency = 2000,
    start_context = format_record(validation_data[0]),
    tokenizer = tokenizer,
    checkpoint_interval = 2000,
    total_steps_per_epoch = total_steps_per_epoch,
    warmup_steps = warmup_steps,
    initial_lr = initial_lr,
    peak_lr = peak_lr,
    min_lr = min_lr
)

And saved the 10-epoch checkpoint:

torch.save(model.state_dict(), "sydsgpt/sydsgpt_v2_164m_finetuned_alpaca_10epochs.pth")

14. Visualizing Training and Validation Loss

I used the same plotting pattern twice (after 6 and after 10 epochs) to visualize loss:

import numpy as np
from matplotlib import pyplot as plt

# Concatenate losses and tokens across both training runs
train_loss = training_losses
val_loss = validation_losses

# Build step indices for raw series
steps = np.arange(1, len(train_loss) + 1)

# Simple moving average smoothing
def smooth_series(y, window=101):
    if len(y) < 3:
        return np.array(y)
    # Choose an odd window <= len(y)
    w = min(window, max(3, (len(y) // 50) * 2 + 1))
    if w % 2 == 0:
        w += 1
    kernel = np.ones(w) / w
    return np.convolve(y, kernel, mode='same')

# Average training loss every 2000 steps
bin_size = 2000
num_bins = int(np.ceil(len(train_loss) / bin_size))
train_bins = [
    np.mean(train_loss[i * bin_size : (i + 1) * bin_size])
    for i in range(num_bins)
]
# Use bin midpoints for x-axis
bin_steps = np.array([
    int(min(((i * bin_size) + min(len(train_loss), (i + 1) * bin_size)) // 2, len(train_loss)))
    for i in range(num_bins)
])

# Smooth the binned training loss for nicer curves
train_binned_smooth = smooth_series(np.array(train_bins), window=min(21, len(train_bins) if len(train_bins) > 0 else 21))

# Smooth validation loss (keep at per-step resolution)
val_smooth = smooth_series(np.array(val_loss))

# Create side-by-side subplots: training (binned + smoothed) and validation
fig, axs = plt.subplots(1, 2, figsize=(14, 6), sharex=False)

# Left: Training loss averages (smoothed)
ax1 = axs[0]
ax1.plot(bin_steps, train_binned_smooth, label='Training Loss (avg per 2000 steps, smoothed)', color='tab:blue')
ax1.set_xlabel('Training Steps')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss (Averaged & Smoothed)')
ax1.legend(loc='upper right')

# Right: Validation loss (smoothed) on its own
val_steps = np.linspace(1, len(train_loss), len(val_smooth), dtype=int)
ax2 = axs[1]
ax2.plot(val_steps, val_smooth, label='Validation Loss (smoothed)', color='tab:orange')
ax2.set_xlabel('Training Steps')
ax2.set_ylabel('Loss')
ax2.set_title('Validation Loss (Smoothed)')
ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

15. Final Validation Loss Evaluation

I computed the final validation loss over all batches after 6 and 10 epochs using calc_loader_loss:

from modules.Loss import calc_loader_loss
final_validation_loss = calc_loader_loss(validation_dataloader, model, device, num_batches=len(validation_dataloader))
print(f"Final Validation Loss after 6 epochs (All batches): {final_validation_loss:.4f}")

And again after 10 epochs:

from modules.Loss import calc_loader_loss
final_validation_loss = calc_loader_loss(validation_dataloader, model, device, num_batches=len(validation_dataloader))
print(f"Final Validation Loss after 10 epochs (All batches): {final_validation_loss:.4f}")

The Validation loss was better after 6 epochs than after 10 epochs, so i decided to proceed with the 6 epochs model for the next steps.

16. Loading the Fine-Tuned Model for Evaluation

To evaluate, I reloaded the fine-tuned model

from model.SydsGPTv2 import SydsGPTv2

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = SydsGPTv2(SYDSGPT_CONFIG_V2_164M)
model.load_state_dict(torch.load("sydsgpt/sydsgpt_v2_164m_finetuned_alpaca_6epochs.pth", map_location=device))
model.to(device)

17. Generating and Evaluating Responses on Test Data

I generated responses on the first five test examples

from modules.Generate import text_to_tokens, tokens_to_text, generate
for record in test_data[:5]:
    input_text = format_record(record)
    print(f"Input Text:\n{input_text.replace("<|user|>", "")}")
    input_tokens = text_to_tokens(input_text, tokenizer).to(device)
    output_tokens = generate(
        model,
        input_tokens,
        max_new_tokens = 200,
        context_size = 2048,
        temperature = 0.7,
        top_k = 40,
        eos_id = tokenizer.eot_token
    )
    output_text = tokens_to_text(output_tokens, tokenizer)
    response_text = output_text[len(input_text):].replace("<|assistant|>", "").strip()
    print(f"Model Response:\n{response_text}")
    print(f"Correct Response:\n{record['output']}")

This qualitative comparison was key for understanding how well the model was actually following instructions, beyond loss curves.

18. Custom Prompt Generation

I also tried a custom instruction:

from modules.Generate import text_to_tokens, tokens_to_text, generate
model.eval()
input_text = """<|user|>
give me exactly 3 different sentences about the earth."""
print(f"Input Text:\n{input_text.replace("<|user|>", "")}")
input_tokens = text_to_tokens(input_text, tokenizer).to(device)
output_tokens = generate(
    model,
    input_tokens,
    max_new_tokens = 200,
    context_size = 2048,
    temperature = 0.7,
    top_k = 40,
    eos_id = tokenizer.eot_token
)
output_text = tokens_to_text(output_tokens, tokenizer)
response_text = output_text[len(input_text):].replace("<|assistant|>", "").strip()
print(f"Model Response:\n{response_text}")

This is the kind of prompt that reveals adherence to constraints (e.g., “exactly 3 sentences”) — something small models often struggle with.

19. Deployment: Hugging Face Model + Gradio Space

From here, I:

  • Uploaded sydsgpt_v2_164m_finetuned_alpaca_6epochs.pth (and later 10 epochs) to Hugging Face as a model repo.
  • Wrapped the model in a Gradio app hosted on Hugging Face Spaces to demo interactive instruction-following.

You can check out the running demo here: Sydsgpt V2 165M SFT Demo – a Hugging Face Space by siddsachar

The model can follow basic instructions, respond sensibly, and display some generalization. But it quickly showed its limits: shallow reasoning, inconsistent adherence to constraints, and occasional drift in longer responses.

20. Honest Reflection: 164M Parameters Are Not Enough

This is the crux of Part 8.

After multiple epochs of fine-tuning, careful LR scheduling, and qualitative evaluation via a live demo, the conclusion was clear:

A 164M parameter GPT-style model has insufficient capacity for robust, high-quality instruction following — at least with this approach and this dataset.

It’s a strong teaching model, a great vehicle for learning and documentation, and a very capable toy. But as an instruction follower, it falls short of what we’ve come to expect from modern assistants.

And that’s okay — that was part of the experiment.

21. Where I’m Going Next

This is the final part of this series, but not the end of the project.

Next steps will focus on:

  • Scaling capacity:
    Expanding the base model to ~500M parameters while reusing as much of the existing training pipeline as possible.
  • LoRA-based specialization:
    Adding LoRA adapters for:
    • Instruction following
    • Summarization
    • Q&A
    • And later, tool use and RAG
  • Multi-adapter design:
    Exploring how to route between different adapters and capabilities without retraining the entire base model.
  • Tool use and RAG:
    Giving the model structured access to tools and a retrieval layer so it can ground its outputs in external knowledge.

Try It Yourself

The full notebook with all the steps, from preparing the corpus, data loaders, loss computation, pretraining loop, text sampling and generation, is available here:

SydsGPT ALPACA SFT Repository

Clone the repo, open the Jupyter notebook, and step through the code.

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

Closing the Series

This series was never just about “getting a model to work.” It was about:

  • Understanding pretraining end-to-end on modest hardware.
  • Building a data + model + training pipeline from scratch.
  • Documenting the process so others can reproduce and adapt it.
  • Testing the boundary of what a small, self-hostable model can do.

Instruction fine-tuning on Alpaca — and deploying the result — was the last major piece. It showed the limits of 164M, and in doing so, set the direction for what comes next.

The next chapter won’t be called “from scratch” anymore. It’ll be about scaling, modularity, and practical alignment.

Source Code

instruction-finetuning-nooutput


Leave a Reply

Your email address will not be published. Required fields are marked *