Ask most developers how their AI agent "remembers" things and they'll point at the context window. "It's all in the prompt," they'll say. And technically, they're right — everything the model knows during a conversation is in the context window. But that's not memory. That's working memory. And treating them as the same thing is one of the most expensive architectural mistakes in agent development.
What a context window actually is
The context window is the total number of tokens an LLM can process in a single call — input plus output. GPT-4o has a 128k token context window. Claude has 200k. Gemini has 1 million. These are impressive numbers, but they are fundamentally session-scoped.
When the API call ends, the context is gone. The model retains nothing. Start a new call, and you're back to zero. The context window is a whiteboard that gets erased after every meeting.
Persistent memory = what the model knows across all sessions, permanently.
Confusing the two leads builders toward a dead end: trying to use the context window as long-term storage. It seems to work at first. Then it breaks, costs a fortune, and degrades quality — usually all at once.
Why context windows make poor memory
Works until it doesn't
- ✗Wiped clean at the end of every session
- ✗Grows linearly with conversation count
- ✗Every token costs money on every call
- ✗Overflows for long-running relationships
- ✗Forces model to scan noise to find signal
- ✗Identical cost whether fact is used or not
Scales as you need
- ✓Survives across sessions, channels, agents
- ✓Profile size stays fixed (~400 tokens)
- ✓37× cheaper context injection per call
- ✓Handles unlimited history without overflow
- ✓Pre-extracted signal, zero noise
- ✓Continuously refined, never duplicated
The RAM vs hard drive analogy
If you've ever thought about computer memory, the analogy maps perfectly. A CPU's RAM is fast, limited, and volatile — it holds whatever the processor is working on right now. When the power goes off, it's gone. Your hard drive (or SSD) is slower to access but persistent — data survives power cycles and grows without practical limit.
The context window is RAM. Persistent memory is the hard drive. An agent that only uses RAM to "store" long-term information is like a computer with no disk — it works fine for a single session and loses everything the moment it restarts.
The right architecture uses both: load only what's needed from persistent storage into the context window at the start of each session, process the conversation in context, then write new learnings back to persistent storage when done.
The three memory tiers every agent needs
Thinking about memory in tiers helps clarify the design:
- In-context (working memory) — the current conversation, immediate instructions, tools available. Lives in the context window. Always fresh.
- Session cache — facts from earlier in the current session that need to stay available. Also in the context window, but can be summarized as the conversation grows.
- Persistent memory (long-term) — everything learned across all past sessions. Lives outside the model, in a dedicated store. Injected as a compact profile at session start.
Most agent frameworks handle the first two reasonably well. The third tier — persistent, cross-session memory — is what's almost universally missing, and it's where most of the value lies.
The right architecture: extract, store, inject
Building persistent memory correctly requires three steps, which should run as distinct processes:
The quality side of the argument
It's tempting to focus only on cost when making the case for persistent memory. But the quality argument is just as strong. A model reading 15,000 tokens of raw transcript must infer what's important — and it often gets this wrong. Important facts are buried under small talk, and recent statements can overweight older, still-valid context.
A well-structured profile presents only the distilled signal. Budget, objections, timeline, relationship context. The model doesn't have to search for what matters — it's already organized for fast consumption. Agents respond better, closer to the first time, every time.
This is what DeepRaven implements. The extraction, storage, and retrieval is handled for you — you just send conversations and fetch profiles. The architecture described in this article is what runs under the hood.