In Part 1 of this series, I built a simple neural network for classification to get comfortable with the basics of deep learning. In Part 2, I created a MiniTokenizer to understand how raw text is transformed into tokens. Now, in Part 3, I am moving one step closer to building a GPT-style model by focusing on data preparation.

Training a large language model (LLM) is not just about designing the architecture. The quality and structure of the data pipeline determine how well the model can learn. This phase is where raw text is cleaned, tokenized, and organized into batches that a transformer can process efficiently. Without a solid data preparation workflow, even the best model design will fail to deliver.

Why Data Preparation Matters

Language models learn by predicting the next token in a sequence. To do this effectively, the training data must be:

  • Unified: scattered text files need to be combined into a single corpus
  • Tokenized: text must be converted into numerical IDs
  • Structured: sequences must be sliced into manageable chunks
  • Batched: data must be grouped for efficient GPU training

This process ensures that the model sees consistent, well-structured input and can learn patterns across a large body of text.

Building the Corpus

For this experiment, I created a test corpus of 20 books from Project Gutenberg. These books span philosophy, science, and literature, giving the model a diverse set of writing styles and vocabularies. Each book is stored as a .txt file in a books/ directory.

The first step is to merge them into a single file. I added a special token between documents to mark boundaries. This helps the model understand where one context ends and another begins.

from pathlib import Path

files = Path('./books').glob('*.txt')
with open('all_books.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        text = Path(file).read_text(encoding='utf-8')
        outfile.write(text + '<EOS>')

The result is a large text file containing all 20 books, ready for tokenization.

Tokenization with tiktoken

In Part 2, I built a MiniTokenizer to understand the basics. For this step, I switched to OpenAI’s tiktoken, which is optimized for speed and memory efficiency. It is the same tokenizer used in GPT models and supports subword tokenization through Byte Pair Encoding (BPE).

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
with open("all_books.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

token_ids = enc.encode(raw_text)
print(f"Total tokens: {len(token_ids)}")

This converts the entire corpus into a sequence of integers. Each integer corresponds to a subword unit, which allows the model to handle rare words and character-level variations more effectively than word-level tokenization.

Creating a PyTorch Dataset

Once tokenized, the data must be structured into sequences for training. I defined a custom PyTorch Dataset that slices the tokenized corpus into overlapping windows. Each input sequence is paired with a target sequence shifted by one token, so the model learns to predict the next token.

import torch
from torch.utils.data import Dataset

class BooksDataset(Dataset):
    def __init__(self, token_ids, max_length=128, step_size=64):
        self.token_ids = token_ids
        self.max_length = max_length
        self.step_size = step_size

    def __len__(self):
        return (len(self.token_ids) - self.max_length) // self.step_size

    def __getitem__(self, idx):
        start = idx * self.step_size
        end = start + self.max_length
        x = torch.tensor(self.token_ids[start:end], dtype=torch.long)
        y = torch.tensor(self.token_ids[start+1:end+1], dtype=torch.long)
        return x, y

This design allows flexibility. By adjusting max_length and step_size, you can control how much context the model sees and how much overlap exists between sequences.

Batching with DataLoader

To train efficiently on GPUs, data must be batched. PyTorch’s DataLoader handles batching and shuffling. I also added a custom collate function to ensure sequences are padded correctly when needed.

from torch.utils.data import DataLoader

dataset = BooksDataset(token_ids, max_length=128, step_size=64)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in dataloader:
    x, y = batch
    print(x.shape, y.shape)
    break

This produces batches of shape (batch_size, sequence_length), which is exactly what a transformer expects.

Adding Embeddings

Transformers cannot work directly with token IDs. They need embeddings that map each token ID to a dense vector. I also included positional embeddings so the model can understand the order of tokens.

import torch.nn as nn

vocab_size = enc.n_vocab
embed_dim = 256
max_length = 128

token_embedding = nn.Embedding(vocab_size, embed_dim)
pos_embedding = nn.Embedding(max_length, embed_dim)

These embeddings will be combined in the transformer model to provide both semantic and positional context.

Lessons Learned

  • Corpus design matters: Choosing diverse texts helps the model generalize better.
  • Special tokens are essential: markers give the model clear boundaries.
  • Tokenization is powerful: Subword tokenization handles rare words more gracefully than word-level approaches.
  • Batching is critical: Efficient batching ensures training runs smoothly on GPUs.
  • Inspect everything: Checking token counts, sequence shapes, and batch outputs before training prevents costly mistakes later.

Try It Yourself

You can run this workflow with your own text data. The repository is available on GitHub:

👉 Data Preparation Repository

Clone the repo, add your .txt files to the books/ directory, and run the notebook. You can experiment with different sequence lengths, batch sizes, and tokenizers.

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

What’s Next

With the data pipeline in place, I am ready to move on to the transformer architecture itself. In Part 4, I will implement the building blocks of GPT: attention mechanisms, transformer layers, and causal masking. This is where the model starts to come alive.

Stay tuned for Part 4.

Source Code

DataPreparation

Leave a Reply

Your email address will not be published. Required fields are marked *