Why Modern LLMs Dropped Mean Centering (And Got Away With It)

Visualizing the hidden 3D geometry behind Layer Normalization and uncovering the mathematical trick that makes RMSNorm tick.

Feb 22, 2026 Machine Learning, Deep-learning

The Epsilon Trap: When Adam Stops Being Adam

Beyond numerical stability, we investigate an often overlooked hyperparameter in the Adam optimizer: epsilon.

Jan 17, 2026 Machine Learning, Deep-learning, Optimization, Adam

Entropic Instruction Following: Does Semantic Coherence Help LLMs Follow Instructions?

Testing whether semantic relatedness of instructions affects a model's ability to follow them under cognitive load

Dec 2, 2025 Machine Learning, Deep-learning

Elements Of Mechanistic Interpretability: From Observation to Causation

We strip down mechanistic interpretability to three key experiments: watching a model 'think', finding where it stores concepts, and performing 'causal surgery' to change its 'thought process'

Oct 26, 2025 Machine Learning, Deep-learning

SFT vs. DPO (/ RLHF)- A Visual Guide to What Your LLM Actually Learns

A visual guide and toy experiment to build intuition for the practical differences between Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

Aug 30, 2025 Machine Learning, Deep-learning

Do You Need A Matryoshka Model?

An analysis conducted on the informations hotspots on embeddings.

Jun 22, 2025 Machine Learning, Deep-learning

Doing MORE To consume LESS – Flash Attention V1

Flash Attention played a major role in making LLMs more accessible to consumers. This algorithm embodies how a set of what one might consider "trivial ideas" can come together and form a powerful s...

Feb 8, 2025 Machine Learning, Deep-learning

Row of the contextualized representation needed for predicting the next token

KV cache – The how not to waste your FLOPS starter

You've probably heard of the Transformers by now, they're everywhere, so much so that new born babies are gonna start saying Transformers as their first word, this blog will explore an important co...

Nov 21, 2024 Machine Learning, Deep-learning

Attention scores, Scaling and Softmax

If you're familiar with the Attention Mechansim, then you know that before applying a softmax to the attention scores, we need to rescale them by a factor of $\frac{1}{\sqrt{D_k}}$ where $D_k$ is t...

Nov 11, 2024 Machine Learning, Deep-learning

The dot product of positional encoding with its transpose

The Hidden Beauty of Sinusoidal Positional Encodings in Transformers

In this blog we will shed the light into a crucial component of the Transformers architecture that hasn't been given the attention it deserves, and you'll also get to see some pretty vizualizations!

Nov 1, 2024 Machine Learning, Deep-learning