In Part 1 of this series, I built a simple neural network for classification to get comfortable with the basics of deep learning. In Part 2, I created a MiniTokenizer to understand how raw text is transformed into tokens. Now, in Part 3, I am moving one step closer to building a GPT-style model by focusing on data preparation.

Training a large language model (LLM) is not just about designing the architecture. The quality and structure of the data pipeline determine how well the model can learn. This phase is where raw text is cleaned, tokenized, and organized into batches that a transformer can process efficiently. Without a solid data preparation workflow, even the best model design will fail to deliver.

Why Data Preparation Matters

Language models learn by predicting the next token in a sequence. To do this effectively, the training data must be:

Unified: scattered text files need to be combined into a single corpus
Tokenized: text must be converted into numerical IDs
Structured: sequences must be sliced into manageable chunks
Batched: data must be grouped for efficient GPU training

This process ensures that the model sees consistent, well-structured input and can learn patterns across a large body of text.

Building the Corpus

For this experiment, I created a test corpus of 20 books from Project Gutenberg. These books span philosophy, science, and literature, giving the model a diverse set of writing styles and vocabularies. Each book is stored as a .txt file in a books/ directory.

The first step is to merge them into a single file. I added a special token between documents to mark boundaries. This helps the model understand where one context ends and another begins.

from pathlib import Path

files = Path('./books').glob('*.txt')
with open('all_books.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        text = Path(file).read_text(encoding='utf-8')
        outfile.write(text + '<EOS>')

The result is a large text file containing all 20 books, ready for tokenization.

Tokenization with `tiktoken`

In Part 2, I built a MiniTokenizer to understand the basics. For this step, I switched to OpenAI’s tiktoken, which is optimized for speed and memory efficiency. It is the same tokenizer used in GPT models and supports subword tokenization through Byte Pair Encoding (BPE).

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
with open("all_books.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

token_ids = enc.encode(raw_text)
print(f"Total tokens: {len(token_ids)}")

This converts the entire corpus into a sequence of integers. Each integer corresponds to a subword unit, which allows the model to handle rare words and character-level variations more effectively than word-level tokenization.

Creating a PyTorch Dataset

Once tokenized, the data must be structured into sequences for training. I defined a custom PyTorch Dataset that slices the tokenized corpus into overlapping windows. Each input sequence is paired with a target sequence shifted by one token, so the model learns to predict the next token.

import torch
from torch.utils.data import Dataset

class BooksDataset(Dataset):
    def __init__(self, token_ids, max_length=128, step_size=64):
        self.token_ids = token_ids
        self.max_length = max_length
        self.step_size = step_size

    def __len__(self):
        return (len(self.token_ids) - self.max_length) // self.step_size

    def __getitem__(self, idx):
        start = idx * self.step_size
        end = start + self.max_length
        x = torch.tensor(self.token_ids[start:end], dtype=torch.long)
        y = torch.tensor(self.token_ids[start+1:end+1], dtype=torch.long)
        return x, y

This design allows flexibility. By adjusting max_length and step_size, you can control how much context the model sees and how much overlap exists between sequences.

Batching with DataLoader

To train efficiently on GPUs, data must be batched. PyTorch’s DataLoader handles batching and shuffling. I also added a custom collate function to ensure sequences are padded correctly when needed.

from torch.utils.data import DataLoader

dataset = BooksDataset(token_ids, max_length=128, step_size=64)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in dataloader:
    x, y = batch
    print(x.shape, y.shape)
    break

This produces batches of shape (batch_size, sequence_length), which is exactly what a transformer expects.

Adding Embeddings

Transformers cannot work directly with token IDs. They need embeddings that map each token ID to a dense vector. I also included positional embeddings so the model can understand the order of tokens.

import torch.nn as nn

vocab_size = enc.n_vocab
embed_dim = 256
max_length = 128

token_embedding = nn.Embedding(vocab_size, embed_dim)
pos_embedding = nn.Embedding(max_length, embed_dim)

These embeddings will be combined in the transformer model to provide both semantic and positional context.

Lessons Learned

Corpus design matters: Choosing diverse texts helps the model generalize better.
Special tokens are essential: markers give the model clear boundaries.
Tokenization is powerful: Subword tokenization handles rare words more gracefully than word-level approaches.
Batching is critical: Efficient batching ensures training runs smoothly on GPUs.
Inspect everything: Checking token counts, sequence shapes, and batch outputs before training prevents costly mistakes later.

Try It Yourself

You can run this workflow with your own text data. The repository is available on GitHub:

👉 Data Preparation Repository

Clone the repo, add your .txt files to the books/ directory, and run the notebook. You can experiment with different sequence lengths, batch sizes, and tokenizers.

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

What’s Next

With the data pipeline in place, I am ready to move on to the transformer architecture itself. In Part 4, I will implement the building blocks of GPT: attention mechanisms, transformer layers, and causal masking. This is where the model starts to come alive.

Stay tuned for Part 4.

Source Code

DataPreparation

In [2]:

import tiktoken
from importlib.metadata import version
import torch
from torch.utils.data import Dataset, DataLoader
from pathlib import Path

Tokenization with tiktoken¶

This cell demonstrates how to use the tiktoken library for tokenizing and detokenizing text, which is essential for working with language models such as those from OpenAI.

Steps Covered¶

Install and Import tiktoken: Ensure the tiktoken library is installed and import it along with the version utility.
Check tiktoken Version: Print the installed version of tiktoken to verify the setup.
Initialize Tokenizer: Load the cl100k_base encoding, which is commonly used for OpenAI models.
Explore Vocabulary Size: Display the size of the tokenizer’s vocabulary.
Tokenize Sample Text: Encode a sample string into tokens and display the result.
Decode Tokens: Convert the tokens back to text and verify the output.

Note: Tokenization is the process of converting text into a sequence of tokens (numbers) that a model can understand. Detokenization is the reverse process, converting tokens back into human-readable text.

In [3]:

print(f'Tiktoken version: {version('tiktoken')}')

tokenizer = tiktoken.get_encoding("cl100k_base")
print(f'Vocabulary size: {tokenizer.n_vocab}')

sample_text = "Hello, world! This is a test of how well the tokenizer works."
tokens = tokenizer.encode(sample_text)
print(f'Encoded tokens: {tokens}')

decoded = tokenizer.decode(tokens)
print(f'Decoded text: {decoded}')

Tiktoken version: 0.11.0
Vocabulary size: 100277
Encoded tokens: [9906, 11, 1917, 0, 1115, 374, 264, 1296, 315, 1268, 1664, 279, 47058, 4375, 13]
Decoded text: Hello, world! This is a test of how well the tokenizer works.

Combine Text Files for LLM Training¶

This cell merges all .txt files from the books directory into a single file, all_books.txt, with each book separated by an <EOS> (End Of Sequence) token. This is a common preprocessing step for language model training, allowing the model to learn document boundaries.

What the Code Does¶

Finds all .txt files in the books directory using Path.glob.
Opens all_books.txt for writing in UTF-8 encoding.
Iterates through each file, reads its content, and writes it to the output file.
Appends <EOS> after each book to mark the end of a document.

Why use <EOS>?

The <EOS> token helps the language model distinguish where one document ends and another begins. This is important for tasks like text generation, where you want the model to respect document boundaries.

In [4]:

files = Path('./books').glob('*.txt')
with open('all_books.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        outfile.write(Path(file).read_text(encoding='utf-8') + '<EOS>')

Tokenize the Combined Text Corpus¶

This cell reads the entire combined text file (all_books.txt) and tokenizes it using the tiktoken tokenizer. Tokenization is a crucial preprocessing step for training language models, as it converts raw text into a sequence of tokens (integers) that the model can understand.

What the Code Does¶

Reads the full corpus: Loads all text from all_books.txt into memory.
Tokenizes the text: Uses the tokenizer.encode() method to convert the text into a list of tokens.
Prints the token count: Displays the total number of tokens, which is useful for estimating dataset size and batching during training.

Tip: Knowing the total number of tokens helps you plan batch sizes, training steps, and memory requirements for your LLM training pipeline.

In [5]:

with(open('all_books.txt', 'r', encoding = 'utf-8') as textfile):
    all_text = textfile.read()

encoded_text = tokenizer.encode(all_text)

print(f'Total number of tokens: {len(encoded_text)}')

Total number of tokens: 4962540

Creating a PyTorch Dataset for Language Model Training¶

This cell defines a custom BooksDataset class, which prepares tokenized text data for training an autoregressive language model (such as GPT). The dataset generates input and target sequences for each training example, enabling the model to learn to predict the next token in a sequence.

What the Code Does¶

Initializes the Dataset: Takes the full text, a tokenizer, a maximum sequence length (max_length), and a step size (step size between sequences).
Tokenizes the Text: Encodes the entire text into tokens using the provided tokenizer.
Creates Overlapping Chunks: For each position in the tokenized text, creates an input chunk of length max_length and a target chunk that is shifted by one token (the next-token prediction target).
Stores as Tensors: Converts each chunk into a PyTorch tensor for efficient batching and training.
Implements __len__ and __getitem__: Standard PyTorch Dataset methods for compatibility with DataLoader.

Tip:

Adjust max_length and stride to control the size and overlap of training samples.

This dataset structure is ideal for next-token prediction tasks, where the model learns to generate text one token at a time.

In [6]:

class BooksDataset(Dataset):
    def __init__(self, text, tokenizer, max_length, step_size):
        self.input_ids = []
        self.target_ids = []

        encoded_text = tokenizer.encode(text)

        for i in range(0, len(encoded_text) - max_length, step_size):
            input_chunk = encoded_text[i:i + max_length]
            target_chunk = encoded_text[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    
    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]

Utility Function: Create DataLoader for LLM Training¶

This cell defines a helper function, create_dataloader, which streamlines the process of preparing a PyTorch DataLoader for language model training. It takes raw text and returns a DataLoader that yields batches of input and target token sequences, ready for model training.

What the Code Does¶

Defines create_dataloader: Accepts the full text, maximum sequence length, step size (stride), and batch size as arguments.
Initializes the Tokenizer: Uses the cl100k_base encoding from tiktoken for tokenization.
Creates a BooksDataset: Uses the custom dataset class to generate input-target pairs from the tokenized text.
Builds a DataLoader: Wraps the dataset in a PyTorch DataLoader for efficient batching, shuffling, and parallel data loading.
Returns the DataLoader: Ready to be used in a training loop for next-token prediction tasks.

Tip:

Adjust max_length, step_size, and batch_size to fit your model and hardware constraints.

This function abstracts away the repetitive setup code, making your training pipeline cleaner and more modular.

In [7]:

def create_dataloader(text, max_length = 512, step_size = 256, batch_size = 8, shuffle = True):
    tokenizer = tiktoken.get_encoding('cl100k_base')
    dataset = BooksDataset(text,tokenizer, max_length, step_size)
    dataloader = DataLoader(
        dataset = dataset,
        batch_size = batch_size,
        shuffle = shuffle,
        drop_last = True,
        num_workers = 0
    )

    return dataloader

Example: Using the DataLoader for Batching¶

This cell demonstrates how to use the create_dataloader utility to generate batches of input and target sequences for language model training. It shows how to iterate through the DataLoader and inspect the structure of each batch.

What the Code Does¶

Creates a DataLoader: Calls create_dataloader with the full text, a batch size of 2, a maximum sequence length of 8, and a step size of 4. Shuffling is disabled for demonstration purposes.
Iterates Through Batches: Loops through the DataLoader and prints the first 4 batches, showing the input and target tensors for each batch.
Batch Structure: Each batch contains a tuple of input and target tensors, where:
- The input tensor is a sequence of token IDs of length max_length.
- The target tensor is the same sequence shifted by one token (for next-token prediction).

Tip:

Adjust batch_size, max_length, and step_size to match your model and hardware.

Inspecting batches before training helps verify that your data pipeline is working as expected.

In [8]:

dataloader = create_dataloader(all_text, batch_size = 2, max_length = 8, step_size = 4, shuffle = False)

for i, batch in enumerate(dataloader):
    print(f"Batch {i}: {batch}")
    if i == 3:
        break

Batch 0: [tensor([[ 3305,   791,  5907, 52686, 58610,   315,   578, 19121],
        [58610,   315,   578, 19121, 21785,   315, 12656, 42482]]), tensor([[  791,  5907, 52686, 58610,   315,   578, 19121, 21785],
        [  315,   578, 19121, 21785,   315, 12656, 42482,  7361]])]
Batch 1: [tensor([[21785,   315, 12656, 42482,  7361,  2028, 35097,   374],
        [ 7361,  2028, 35097,   374,   369,   279,  1005,   315]]), tensor([[  315, 12656, 42482,  7361,  2028, 35097,   374,   369],
        [ 2028, 35097,   374,   369,   279,  1005,   315,  5606]])]
Batch 2: [tensor([[  369,   279,  1005,   315,  5606, 12660,   304,   279],
        [ 5606, 12660,   304,   279,  3723,  4273,   323,   198]]), tensor([[  279,  1005,   315,  5606, 12660,   304,   279,  3723],
        [12660,   304,   279,  3723,  4273,   323,   198,  3646]])]
Batch 3: [tensor([[3723, 4273,  323,  198, 3646, 1023, 5596,  315],
        [3646, 1023, 5596,  315,  279, 1917,  520,  912]]), tensor([[4273,  323,  198, 3646, 1023, 5596,  315,  279],
        [1023, 5596,  315,  279, 1917,  520,  912, 2853]])]

Token and Positional Embeddings for LLMs¶

This cell demonstrates how to combine token embeddings and positional embeddings, which are essential components in transformer-based language models (LLMs) like GPT. Token embeddings represent the meaning of each token, while positional embeddings encode the position of each token in the input sequence, allowing the model to capture word order and context.

What the Code Does¶

Defines vocab_size and embedding_dim: Sets the vocabulary size and embedding dimension for the model.
Sets context_length: Specifies the maximum sequence length (context window) the model can process.
Creates Token Embedding Layer: Maps each token ID to a dense vector of size embedding_dim.
Creates Positional Embedding Layer: Maps each position (from 0 to context_length - 1) to a dense vector of the same size.
Loads a Batch of Data: Uses the DataLoader to get a batch of input and target token sequences.
Computes Token Embeddings: Looks up embeddings for each token in the input batch.
Computes Positional Embeddings: Looks up embeddings for each position in the sequence.
Combines Embeddings: Adds token and positional embeddings to form the final input embeddings for the model.
Prints Shapes: Displays the shapes of the resulting tensors to verify correctness.

Tip:

Adding token and positional embeddings is a standard practice in transformer models, enabling them to understand both content and order of tokens.

Ensure that context_length matches the maximum sequence length used during training and inference.

In [29]:

vocab_size = tokenizer.n_vocab
embedding_dim = 512
context_length = 1024

token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)
positional_embedding_layer = torch.nn.Embedding(context_length, embedding_dim)

dataloader = create_dataloader(all_text, batch_size = 8, max_length = context_length, step_size = 512, shuffle = True)
dataiter = iter(dataloader)

test_input, test_targets = next(dataiter)


token_embeddings = token_embedding_layer(test_input)
print(f'Token embeddings shape: {token_embeddings.shape}')

positional_embeddings = positional_embedding_layer(torch.arange(context_length))
print(f'Positional embeddings shape: {positional_embeddings.shape}')

input_embeddings = token_embeddings + positional_embeddings
print(f'Input embeddings shape: {input_embeddings.shape}')

Token embeddings shape: torch.Size([8, 1024, 512])
Positional embeddings shape: torch.Size([1024, 512])
Input embeddings shape: torch.Size([8, 1024, 512])

AI & Cloud by Syd

Part 3: Preparing Data for Training a Language Model

Why Data Preparation Matters

Building the Corpus

Tokenization with `tiktoken`

Creating a PyTorch Dataset

Batching with DataLoader

Adding Embeddings

Lessons Learned

Try It Yourself

👉 Data Preparation Repository

Build It Yourself

What’s Next

Source Code

Tokenization with tiktoken¶

Steps Covered¶

Combine Text Files for LLM Training¶

What the Code Does¶

Tokenize the Combined Text Corpus¶

What the Code Does¶

Creating a PyTorch Dataset for Language Model Training¶

What the Code Does¶

Utility Function: Create DataLoader for LLM Training¶

What the Code Does¶

Example: Using the DataLoader for Batching¶

What the Code Does¶

Token and Positional Embeddings for LLMs¶

What the Code Does¶

Leave a Reply Cancel reply