In Part 1 of this series, I built a simple neural network for binary and multiclass classification to get comfortable with the fundamentals of deep learning. For Part 2, I shifted focus to something equally important in the world of transformers: tokenization.

Transformers do not work directly with raw text. They need text to be broken down into smaller units called tokens, which are then mapped to numerical IDs. This process is handled by a tokenizer. Modern tokenizers like Byte Pair Encoding (BPE) or WordPiece are highly optimized, but I wanted to understand what happens under the hood. So I built a MiniTokenizer from scratch in Python.

This project is not meant to replace production-grade tokenizers like tiktoken or Hugging Face’s tokenizers. Instead, it is a learning tool that demonstrates the fundamentals of how text becomes numbers in an NLP pipeline.

Why Build a Tokenizer

Tokenization is the bridge between human-readable text and machine-readable numbers. Without it, models cannot process language. By building my own tokenizer, I learned:

How to split text into meaningful units
How to construct a vocabulary mapping tokens to integer IDs
How encoding and decoding work in practice
Why handling unknown tokens is essential
How regex can be used for simple text processing

This exercise gave me a deeper appreciation for the complexity of modern tokenizers and prepared me for using BPE-based tokenizers in the actual GPT model.

How the MiniTokenizer Works

The MiniTokenizer is implemented in a Jupyter notebook and uses only Python’s standard library. Here are the main steps:

1. Corpus Assembly

All .txt files in a directory are concatenated into a single corpus. An (end-of-sequence) token is added between documents.

files = Path('./texts').glob('*.txt')
with open('all_text.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        outfile.write(Path(file).read_text(encoding='utf-8') + '<EOS>')

2. Tokenization

The text is split using a regular expression that separates punctuation, whitespace, and special characters into their own tokens.

tokenized_text = re.split(r'([,.!?():;_\'"]|--|\s)', raw_text)

3. Vocabulary Construction

A vocabulary is built by mapping each unique token to an integer index. An <UNK> token is added to handle unknown tokens.

all_tokens = sorted(set(tokenized_text))
vocab = {token: index for index, token in enumerate(all_tokens)}
vocab.update({'<UNK>': len(vocab)})

4. Encoding and Decoding

The MiniTokenizer class provides methods to encode text into token IDs and decode token IDs back into text.

class MiniTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.inverse_vocab = {index: token for token, index in vocab.items()}
    
    def encode(self, text):
        tokens = re.split(r'([,.!?():;_\'"]|--|\s)', text)
        return [self.vocab.get(token, self.vocab['<UNK>']) for token in tokens]
    
    def decode(self, token_ids):
        return ''.join([self.inverse_vocab[token_id] for token_id in token_ids])

Example Usage

# Initialize the tokenizer
tokenizer = MiniTokenizer(vocab)

# Encode text
text = "Hello, world! This is a test of how well the tokenizer works."
token_ids = tokenizer.encode(text)
print(token_ids)

# Decode back to text
decoded_text = tokenizer.decode(token_ids)
print(decoded_text)

Output:

[7357, 10, 0, 3, 7182, 4, 0, 3, 1335, 3, 4267, 3, 1473, 3, 6587, 3, 4955, 3, 3929, 3, 7092, 3, 6602, 3, 7357, 3, 7181, 12, 0]

<UNK>, world! This is a test of how well the <UNK> works.

What I Learned

Tokenization is not trivial. Even a simple regex-based tokenizer requires careful handling of punctuation, whitespace, and unknown tokens.
Vocabulary construction is critical. The way you build and prune your vocabulary directly affects model performance.
Encoding and decoding must be consistent. If the mapping is not reversible, the model cannot reliably generate text.
Modern tokenizers like BPE solve many of the limitations of word-level tokenization by breaking words into subwords, which improves handling of rare and unseen words.

Next Steps

In the actual GPT model, I will use a BPE-based tokenizer such as tiktoken. BPE tokenizers are more efficient and better suited for large-scale language models. However, building this MiniTokenizer gave me the intuition I need to understand how those more advanced tokenizers work.

Try It Yourself

The code is available on GitHub. You can clone the repository, add your own text files, and experiment with encoding and decoding.

👉 MiniTokenizer on GitHub

Build It Yourself

If you want to try building it yourself, you can find the complete code with detailed explanations of each block in the source code section at the end of this post. All the best!

Closing Thoughts

This project was a valuable step in my GPT-from-scratch journey. By building a tokenizer, I now understand how raw text is transformed into the numerical sequences that power deep learning models. In the next part of the series, I will begin exploring the transformer architecture itself, starting with attention mechanisms.

Stay tuned for Part 3.

Source Code

MiniTokenizer

In [2]:

from pathlib import Path
import re

📚 Corpus Assembly: Merging Text Files for Tokenization¶

This step combines multiple source documents into a single text file, preparing a unified corpus for downstream tokenization and analysis.

Inputs¶

Directory: texts/ containing .txt files (UTF-8 encoded).
Pattern: All files matching *.txt are included; subfolders are ignored.

Process Overview¶

Enumerate Files: Use Path('./texts').glob('*.txt') to list all text files in the directory.
Read & Concatenate: For each file, read its contents as UTF-8 and append to a single output file.
Separation: Add a end of sequence token (<EOS>) after each file to ensure clear separation between documents.
Output: Write the combined result to all_text.txt in the project root.

Why This Step?¶

Consistency: Ensures all data is in one place for easier processing and reproducibility.
Efficiency: Downstream scripts (tokenizers, analyzers) can operate on a single file, simplifying I/O.
Flexibility: Easy to add or remove source files by updating the texts/ folder.

Validation & Tips¶

Check the size of all_text.txt to confirm all data was written.
Open the file and inspect the start/end of each document for encoding or separator issues.
For reproducible order, use sorted(Path('./texts').glob('*.txt')).
To add metadata, consider writing the filename or a header before each document.

Next Steps¶

Use all_text.txt as input for tokenizer training, vocabulary building, or text analysis.
Optionally, preprocess the text (e.g., normalization, lowercasing) before tokenization if required by your workflow.

In [3]:

files = Path('./texts').glob('*.txt')
with open('all_text.txt', 'w', encoding='utf-8') as outfile:
    for file in files:
        outfile.write(Path(file).read_text(encoding='utf-8') + '<EOS>')

📏 Corpus Character Count: Quick Data Integrity Check¶

After merging your text files into all_text.txt, it’s important to verify that the corpus was assembled correctly and contains the expected amount of data. This cell performs a simple but effective validation by reading the entire file and reporting the total number of characters.

Purpose¶

Sanity Check: Confirms that all_text.txt is not empty and that the concatenation process worked as intended.
Data Integrity: Helps detect issues such as missing files, encoding errors, or incomplete writes.
Baseline Metric: Provides a reference point for future preprocessing or modifications.

What This Cell Does¶

Opens all_text.txt in read mode with UTF-8 encoding.
Reads the entire file content into a string variable (raw_text).
Prints the total number of characters in the corpus.

How to Use the Output¶

Expected Value: The character count should be large and nonzero. If it’s unexpectedly small, check your texts/ directory for missing or empty files.
Troubleshooting: If you encounter encoding errors, ensure all input files are UTF-8 encoded.
Scaling: For very large corpora, consider reading the file in chunks or using file size in bytes (os.stat('all_text.txt').st_size) instead.

Next Steps¶

Use raw_text as input for tokenization, vocabulary extraction, or further text analysis.
Optionally, perform additional validation (e.g., line count, previewing the start/end of the file) to further ensure data quality.

In [4]:

with open('all_text.txt', 'r', encoding='utf-8') as input_text:
    raw_text = input_text.read()
print(f"Total number of characters in the raw text: {len(raw_text)}")

Total number of characters in the raw text: 330569

🧮 Tokenization and Vocabulary Construction¶

This cell performs the core tokenization step and builds a vocabulary from your assembled corpus. It splits the raw text into tokens, counts them, and constructs a mapping from each unique token to a unique integer index.

What This Cell Does¶

Tokenization:
- Uses a regular expression with re.split to break the text into tokens.
- The pattern splits on common punctuation marks ([,.!?():;_'""]), double dashes (--), and whitespace (\s).
- This approach preserves punctuation as separate tokens and ensures that words, punctuation, and spaces are all represented.
Token Count:
- Prints the total number of tokens generated from the corpus.
- Useful for understanding the granularity and size of your tokenized data.
Unique Tokens:
- Converts the token list to a set to extract all unique tokens.
- Sorts them for reproducibility and prints the total count.
- This gives you the vocabulary size, a key metric for language modeling and analysis.
Vocabulary Mapping:
- Creates a dictionary (vocab) mapping each unique token to a unique integer index.
- This mapping is essential for converting text into numerical form for machine learning models.
- Adds a special (<UNK>) token to the end of the vocabulary to handle unknown tokens.

Tips & Customization¶

Regex Tuning: Adjust the regular expression to better fit your language or domain (e.g., handle contractions, special symbols, or multi-word expressions).
Whitespace Handling: The current pattern includes whitespace as tokens. If you want to ignore or merge whitespace, modify the regex accordingly.
Vocabulary Filtering: For large corpora, consider filtering out rare tokens or applying additional normalization (e.g., lowercasing, stemming).

Next Steps¶

Use the tokenized_text list for further processing, such as sequence modeling or n-gram analysis.
The vocab dictionary can be saved and reused for encoding new text or training models.
Analyze token frequency distributions or visualize the most common tokens for insights into your corpus.

In [5]:

tokenized_text = re.split(r'([,.!?():;_\'"]|--|\s)', raw_text)
print(f"Total number of tokens: {len(tokenized_text)}")

all_tokens = sorted(set(tokenized_text))
print(f"Total number of unique tokens: {len(all_tokens)}")

vocab = {token : index for index, token in enumerate(all_tokens)}
vocab.update({'<UNK>' : len(vocab)})

Total number of tokens: 143975
Total number of unique tokens: 7357

🧩 MiniTokenizer Class: Encoding and Decoding Text¶

This cell defines the MiniTokenizer class, which provides simple methods to convert text into sequences of token IDs (encoding) and to reconstruct text from token IDs (decoding) using the vocabulary built in previous steps.

Class Overview¶

Initialization (__init__):
- Takes a vocab dictionary mapping tokens to unique integer indices.
- Builds an inverse_vocab dictionary for reverse lookup (index to token), enabling decoding.
Encoding (encode method):
- Splits input text into tokens using the same regular expression as before (re.split(r'([,.!?():;_\'"]|--|\s)', text)).
- Converts each token into its corresponding integer ID using the vocabulary.
- Adds the special token (<UNK>) if an unkown token is encountered.
- Returns a list of token IDs representing the input text.
Decoding (decode method):
- Converts a list of token IDs back into tokens using inverse_vocab.
- Joins the tokens into a string, separated by spaces.
- Returns the reconstructed text.

Usage Example¶

tokenizer = MiniTokenizer(vocab)
ids = tokenizer.encode("Hello, world!")
print(ids)  # Encoded token IDs
print(tokenizer.decode(ids))  # Decoded text (may include extra spaces)

In [10]:

class MiniTokenizer:
    def __init__(self, vocab):
        self.vocab = vocab
        self.inverse_vocab = {index : token for token, index in vocab.items()}
    def encode(self, text):
        tokens = re.split(r'([,.!?():;_\'"]|--|\s)', text)
        token_ids = [self.vocab[token] if token in self.vocab else self.vocab['<UNK>'] for token in tokens]
        return token_ids
    def decode(self, token_ids):
        text = ''.join([self.inverse_vocab[token_id] for token_id in token_ids])
        return text

In [14]:

tokenizer = MiniTokenizer(vocab)
ids = tokenizer.encode("Hello, world! This is a test of how well the tokenizer works.")
print(ids)  # Encoded token IDs
print(tokenizer.decode(ids))  # Decoded text (may include extra spaces)

[7357, 10, 0, 3, 7182, 4, 0, 3, 1335, 3, 4267, 3, 1473, 3, 6587, 3, 4955, 3, 3929, 3, 7092, 3, 6602, 3, 7357, 3, 7181, 12, 0]
<UNK>, world! This is a test of how well the <UNK> works.

AI & Cloud by Syd

Part 2: Building a Mini Tokenizer from Scratch

Why Build a Tokenizer

How the MiniTokenizer Works

1. Corpus Assembly

2. Tokenization

3. Vocabulary Construction

4. Encoding and Decoding

Example Usage

Output:

What I Learned

Next Steps

Try It Yourself

👉 MiniTokenizer on GitHub

Build It Yourself

Closing Thoughts

Source Code

📚 Corpus Assembly: Merging Text Files for Tokenization¶

Inputs¶

Process Overview¶

Why This Step?¶

Validation & Tips¶

Next Steps¶

📏 Corpus Character Count: Quick Data Integrity Check¶

Purpose¶

What This Cell Does¶

How to Use the Output¶

Next Steps¶

🧮 Tokenization and Vocabulary Construction¶

What This Cell Does¶

Tips & Customization¶

Next Steps¶

🧩 MiniTokenizer Class: Encoding and Decoding Text¶

Class Overview¶

Usage Example¶

Leave a Reply Cancel reply