Blog

Thoughts on programming, machine learning, and technology.

A Recipe for Filtering Language Modelling Data

A deep dive into the methodology and engineering required to transform terabytes of messy, raw Common Crawl data into a high-quality dataset for training language models. This post details a step-by-step filtering recipe, a scalable pipeline architecture, and validates the final corpus by training a GPT-2 model.

August 18, 2025

•

20 min read

Does Tokenizer Leakage Actually Matter? An Experiment with OpenWebText

A controlled experiment deconstructing the dogma of "never train on the validation set" that systematically investigates if including validation data in tokenizer training introduces subtle shortcuts that models exploit.

Modified: August 24, 2025

•

15 min read

How to build an efficient BPE tokenizer in python

A step-by-step guide into building and designing an efficient BPE tokenizerwith a demystification of the different compromises

june 20, 2025

•

25 min read

Positional Encoding heuristic understanding

This blog post offers a fun heuristic to understand positional encoding in high dimensions

April 04, 2025

•

15 min read

Attention as a Softened Hash Table

This blog post takes you on a journey to rediscover attention from first principles. Instead of treating attention as a black box, we break it down step by step, exploring its necessity, design choices, and theoretical foundations—all leading to the question that revolutionized deep learning: Is attention all we need?

March 03, 2025

•

15 min read