Initial thoughts on Llama 4 - Hype?

Hello! Hope you’re doing well.

This blog post will be a bit unusual. I'm here to burst some myths surrounding Llama 4. Don't worry if you're unfamiliar with the context or these models—I'll explain clearly as we go along.

Context

https://x.com/AIatMeta/status/1908598456144531660

Let’s just go straight to the claims made and breakdown it with some logic, analogies and maths (if required).

CLAIM 1 : The 10M Context!

The 10M (10 million) token context isn’t real, no model was trained on prompts longer than 256k tokens. Beyond 256k, output quality drops.

</aside>

Breakdown:

What is context window?

In LLMs, the context window is the amount of text (measured in tokens—think words or word pieces) the model can "see" at once to generate a response.

A 10M token context sounds impressive. BUT!

The model wasn’t trained on sequences longer than 256k tokens. Training is where the model learns patterns from data. If it’s never seen prompts longer than 256k, it’s like asking a chef to cook a 10-course meal when they’ve only practiced single dishes. Beyond 256k, it’s guessing, not reasoning.

Let’s dive in some maths of this.

1 token ≈ 0.75 words (rough average in English).
256k tokens ≈ 192,000 words ≈ a 400-page book.
10M tokens ≈ 7.5M words ≈ 15,000 pages (think War and Peace × 10). Training on 256k-token sequences is already a feat. Scaling to 10M without training data is like extrapolating a graph off the page—mathematically possible but practically shaky. </aside>

Let’s make it more clear with an analogy.

Imagine a painter who’s only worked on 8x10 canvases being asked to paint a mural the size of a football field. He might stretch skills, but the details get blurry. That’s Llama 4 beyond 256k tokens.