In Part 1 of this series, I built a simple neural network for binary and multiclass classification to get comfortable with the fundamentals of deep learning. For Part 2, I shifted focus to something equally important in the world of transformers: tokenization.

Transformers do not work directly with raw text. They need text to be broken down into smaller units called tokens, which are then mapped to numerical IDs. This process is handled by a tokenizer. Modern tokenizers like Byte Pair Encoding (BPE) or WordPiece are highly optimized, but I wanted to understand what happens under the hood. So I built a MiniTokenizer from scratch in Python.

This project is not meant to replace production-grade tokenizers like tiktoken or Hugging Face’s tokenizers. Instead, it is a learning tool that demonstrates the fundamentals of how text becomes numbers in an NLP pipeline.

Why Build a Tokenizer

Tokenization is the bridge between human-readable text and machine-readable numbers. Without it, models cannot process language. By building my own tokenizer, I learned:

  • How to split text into meaningful units
  • How to construct a vocabulary mapping tokens to integer IDs
  • How encoding and decoding work in practice
  • Why handling unknown tokens is essential
  • How regex can be used for simple text processing

This exercise gave me a deeper appreciation for the complexity of modern tokenizers and prepared me for using BPE-based tokenizers in the actual GPT model.

How the MiniTokenizer Works

The MiniTokenizer is implemented in a Jupyter notebook and uses only Python’s standard library. Here are the main steps:

1. Corpus Assembly

All .txt files in a directory are concatenated into a single corpus. An (end-of-sequence) token is added between documents.

files = Path('./texts').glob('*.txt')
with open('all_text.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        outfile.write(Path(file).read_text(encoding='utf-8') + '<EOS>')

2. Tokenization

The text is split using a regular expression that separates punctuation, whitespace, and special characters into their own tokens.

tokenized_text = re.split(r'([,.!?():;_\'"]|--|\s)', raw_text)

3. Vocabulary Construction

A vocabulary is built by mapping each unique token to an integer index. An <UNK> token is added to handle unknown tokens.

all_tokens = sorted(set(tokenized_text))
vocab = {token: index for index, token in enumerate(all_tokens)}
vocab.update({'<UNK>': len(vocab)})

4. Encoding and Decoding

The MiniTokenizer class provides methods to encode text into token IDs and decode token IDs back into text.

class MiniTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.inverse_vocab = {index: token for token, index in vocab.items()}
    
    def encode(self, text):
        tokens = re.split(r'([,.!?():;_\'"]|--|\s)', text)
        return [self.vocab.get(token, self.vocab['<UNK>']) for token in tokens]
    
    def decode(self, token_ids):
        return ''.join([self.inverse_vocab[token_id] for token_id in token_ids])

Example Usage

# Initialize the tokenizer
tokenizer = MiniTokenizer(vocab)

# Encode text
text = "Hello, world! This is a test of how well the tokenizer works."
token_ids = tokenizer.encode(text)
print(token_ids)

# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print(decoded_text)

Output:

[7357, 10, 0, 3, 7182, 4, 0, 3, 1335, 3, 4267, 3, 1473, 3, 6587, 3, 4955, 3, 3929, 3, 7092, 3, 6602, 3, 7357, 3, 7181, 12, 0]

<UNK>, world! This is a test of how well the <UNK> works.

What I Learned

  • Tokenization is not trivial. Even a simple regex-based tokenizer requires careful handling of punctuation, whitespace, and unknown tokens.
  • Vocabulary construction is critical. The way you build and prune your vocabulary directly affects model performance.
  • Encoding and decoding must be consistent. If the mapping is not reversible, the model cannot reliably generate text.
  • Modern tokenizers like BPE solve many of the limitations of word-level tokenization by breaking words into subwords, which improves handling of rare and unseen words.

Next Steps

In the actual GPT model, I will use a BPE-based tokenizer such as tiktoken. BPE tokenizers are more efficient and better suited for large-scale language models. However, building this MiniTokenizer gave me the intuition I need to understand how those more advanced tokenizers work.

Try It Yourself

The code is available on GitHub. You can clone the repository, add your own text files, and experiment with encoding and decoding.

👉 MiniTokenizer on GitHub

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

Closing Thoughts

This project was a valuable step in my GPT-from-scratch journey. By building a tokenizer, I now understand how raw text is transformed into the numerical sequences that power deep learning models. In the next part of the series, I will begin exploring the transformer architecture itself, starting with attention mechanisms.

Stay tuned for Part 3.

Source Code

MiniTokenizer

Leave a Reply

Your email address will not be published. Required fields are marked *