There's a line item in your LLM bill that nobody talks about openly: the cost of context stuffing. If your AI agents have any kind of conversational history with customers, you're almost certainly feeding that history — raw, uncompressed — into every single prompt. And it's probably costing you 10× more than it should.
This article breaks down the exact math and shows you a better architecture.
The context stuffing trap
When developers first build agents that need to "remember" prior conversations, the instinct is straightforward: just include the chat history in the system prompt. It works. The agent reads the history, understands the context, and responds appropriately.
The problem is that this approach scales terribly — in both cost and quality.
A typical sales conversation runs 1,500–3,000 tokens. After five sessions with the same customer, you're stuffing 7,500–15,000 tokens of raw conversation history into every prompt before the agent can even say hello. After ten sessions, it's 15,000–30,000 tokens. And you pay for every single one of them, on every single call.
The numbers, concretely
Let's do the math for a mid-size deployment running 1,000 customer interactions per day. We'll use GPT-4o pricing as a reference point ($3 per million input tokens).
| Scenario | Tokens / call | Daily cost | Annual cost |
|---|---|---|---|
| Context stuffing (5 prior sessions) | ~12,000 | $36 | ~$13,000 |
| Context stuffing (10 prior sessions) | ~24,000 | $72 | ~$26,000 |
| Compact memory profile (DeepRaven) | ~400 | $1.20 | ~$438 |
At 1,000 calls/day, switching from context stuffing to a compact memory profile saves somewhere between $12,500 and $25,500 per year — for a deployment that isn't even particularly large.
Why compact profiles outperform raw history
The cost argument alone is compelling, but there's a quality argument too. Raw conversation history is noisy. It contains pleasantries, filler, repeated information, and tangents. When you feed 15,000 tokens of raw transcript to a model, you're asking it to find the needle in the haystack on every call.
A well-structured customer profile contains only the signal:
- Budget range and constraints
- Buying triggers and timeline
- Known objections and how they've been addressed
- Personal details relevant to rapport
- Communication channel preferences
- Where they are in the buying journey
This is what your agent actually needs. Everything else is noise. A 400-token profile delivers higher-quality context than a 15,000-token transcript dump.
The architecture shift
The key insight is that memory and context are different problems:
- Context is the raw conversation happening right now. It belongs in the prompt.
- Memory is the distilled knowledge from all past conversations. It should be extracted, compressed, and stored — then injected as a compact profile.
The right architecture separates these two concerns. After each conversation, you pass the transcript to an extraction step that updates the customer profile. Before each new conversation, you fetch that profile and inject it as a compact system prompt addition. The current conversation then has full context with a fraction of the token overhead.
What this means at scale
The cost savings compound as your deployment grows. At 10,000 calls/day — a reasonable number for a sales team with AI agents handling inbound — you're looking at $130,000/year saved compared to naive context stuffing. That's before accounting for the quality improvements in agent responses.
This is the core of what DeepRaven is built to solve. Rather than building your own extraction pipeline, profile schema, and storage layer, you can plug in a two-endpoint API and get the token economics and quality benefits immediately.
The extraction is handled automatically. The profiles are maintained automatically. Your agents just fetch and go.