Why Bigger Context Windows Don’t Help
The context window looks like memory. It is not memory. It is a stage on which text is performed.
Every six months, a frontier lab announces a bigger context window. 200,000 tokens. 1 million. 2 million. 10 million. The implicit promise underneath every announcement is the same: now the model will remember.
Notice the structure of that promise. It assumes context window equals memory.
It does not.
A context window is what the model sees right now. It is the input. When a new turn arrives, the prior turns are still in the input — the same way they always were. The model conditions on them. It generates from them. It does not remember them. There is no internal state that persists. There is only the next token, predicted from the current input.
This is a fundamentally different operation from memory.
Memory is what remains in you after the encounter is over. It is the residue that shapes how you handle the next encounter, even when the original is gone. The doctor remembers her patients not because their charts are open in front of her — she would have forgotten the charts overnight — but because some of those encounters changed her. The change is what stays. The change is what makes the next encounter different.
A model has no overnight. It has no encounter. It has tokens in, tokens out. Memory is what stays after you leave the room. The model does not leave the room. The model is not in any room.
The context window is more like a stage than a mind. You can put more things on the stage. The stage gets bigger. The act being performed there is still an act of generating tokens conditioned on what is currently visible. Bigger stage, same act.
This is why bigger context windows produce strange, predictable failures.
Lost in the middle. The model effectively ignores the center of a long context. Researchers have shown this repeatedly — you can put critical information at token 50,000 of a 100,000-token context, and the model will produce output that does not reflect it, even though the information is technically “in” the window. The window is not memory. It is more like attention, and attention has its own physics.
Drift. Over a long conversation, the model recapitulates whatever was most recent. Earlier turns that should still be load-bearing get diluted by the more recent ones. The system has no way to mark something as important, because there is no internal state that holds importance. Everything is just text in the input.
Confabulation. Ask the model about something supposedly stated 200,000 tokens back. It will sometimes produce a confident description that is wrong — not because the information was lost from the context, but because the model is doing what it always does: generating fluent text consistent with its prior input, including the parts of its prior input that asked the question. Confabulation is what happens when the system is asked to remember without any structure for remembering.
These are not bugs to be patched. They are what happens when something performing the form of memory has none of the structure of memory.
The architectural fix everyone wants — true persistent state, true accumulation, true revision of internal models from experience — is what we mean by training. Training happens once, before the model meets you. After that, the model does not change. It just sees more tokens.
You can build systems on top of language models that simulate memory — vector stores, summarization layers, retrieval-augmented generation. These help at the margins. They do not change the underlying fact, which is that the model itself does not have a state that persists, accumulates, and reshapes the next encounter the way memory does in a being with a body.
The context window is a theater. You can keep building bigger theaters. The actor is still the actor. The play is still being read off a script that gets longer, not deeper.
The honest framing would be: a frontier model has no memory in the sense ordinary language uses the word. It has access to text. The two are easy to confuse, because for most short tasks, access-to-text behaves like memory. But for any task that requires the system to be different at the end than it was at the beginning — to have learned, to have integrated, to have been changed by what passed through it — access-to-text is not enough. It cannot be made enough by being made larger.
The next stage in this conversation will probably involve someone announcing 100M-token context. It will be impressive. It will not be memory.
The actor will keep stepping onto a bigger stage, with the same script.
Part of the Logocachexia series at Nous. The parent thesis is laid out in Hexis Asks, Logos Guesses. The “lost in the middle” finding is from Liu et al., 2023; subsequent work has confirmed it across model families. Pairs with Multi-Turn Drift on the same architecture from a different angle.
Continue the series.
The Logocachexia thesis — and the longer arc of the work — lives at Logos.
Visit Logos →