Context Windows, Memory, and Why AI Forgets: Designing Around Token Limits
Your user has been chatting with your AI assistant for 20 minutes, providing context about their project, their preferences, their constraints. They ask a follow-up question—and the AI responds as if the conversation just started. The user is confused, then frustrated, then gone. You just lost them, and the reason is a concept most PMs never think about: the context window.
What is a context window?
A context window is the maximum number of tokens an LLM can process in a single inference call. Think of it as the model's working memory—everything it can "see" at once. This includes the system prompt, the conversation history, any injected context (like retrieved documents), and the model's own response.
Here's a rough landscape as of early 2026:
- GPT-4o: 128K tokens (~96,000 words)
- Claude 3.5 Sonnet: 200K tokens (~150,000 words)
- Gemini 1.5 Pro: 1M+ tokens (~750,000 words)
- GPT-4o-mini: 128K tokens (~96,000 words)
These numbers look enormous. A 200K context window is roughly the length of Moby Dick. So why does AI still "forget"?
| Model | Window Size | Effective Range | Cost Impact |
|---|---|---|---|
| GPT-4o | 128K tokens | ~80K reliable | Moderate |
| Claude 3.5 Sonnet | 200K tokens | ~150K reliable | Higher |
| Gemini 1.5 Pro | 1M tokens | ~700K reliable | Variable |
| Llama 3.1 | 128K tokens | ~64K reliable | Self-hosted |
The dirty secret: Window size ≠ effective recall
Having a 200K token context window does not mean the model uses all 200K tokens equally well. Research consistently shows a "lost in the middle" effect: LLMs attend most strongly to information at the beginning and end of the context, and performance degrades for information buried in the middle.
In a landmark study by Liu et al. (2023), researchers found that LLM performance on retrieval tasks dropped by up to 20% when the relevant information was placed in the middle of a long context, compared to placing it at the beginning or end.
This means a 200K window doesn't give you 200K tokens of reliable context. It gives you maybe 10-20K of high-fidelity context at the edges, with a long tail of degraded attention in the middle. For product design, this is the number that matters—not the headline context window size.
Why conversations "reset"
Most AI products manage context in one of two ways, and both have sharp edges:
Pattern 1: Full history in context
Every message in the conversation is included in each API call. This works beautifully for short conversations. But as the conversation grows, you hit two walls:
- Cost: You're re-sending the entire history with every message. A 50-turn conversation might be 20K tokens of history, and you're paying for it on every single call.
- Overflow: Eventually the history exceeds the context window. Now you have to truncate—and deciding what to truncate is a product decision with real consequences.
Pattern 2: Truncated history
Only the last N messages are included. This is cheap and simple, but the AI literally cannot remember what the user said 15 minutes ago. If the user set up a complex scenario in messages 1-5 and is now on message 20, that context is gone.
Designing around the limits: Four proven patterns
1. Sliding window with summarization
Instead of truncating old messages, summarize them. When the conversation exceeds a threshold (say, 8K tokens), use the LLM itself to generate a concise summary of the conversation so far. Insert this summary at the beginning of the context, then include only the most recent messages in full.
Trade-offs: You lose granular detail from early in the conversation, but you preserve the gist. The summary call adds latency and cost, but it's cheaper than including full history. This pattern works well for customer support bots and coding assistants where the overall intent matters more than exact phrasing from 10 minutes ago.
2. Retrieval-Augmented Generation (RAG)
Instead of stuffing everything into the context window, store information externally (in a vector database) and retrieve only what's relevant to the current query. The user's message is embedded, similar documents are retrieved, and those documents are injected into the prompt as context.
Trade-offs: Retrieval quality is only as good as your embeddings and chunking strategy. If you chunk a 100-page document into 500-token segments, the retrieval might pull a chunk that's missing critical surrounding context. RAG is excellent for grounding answers in specific documents, but it's not a drop-in replacement for conversational memory.
3. Explicit memory stores
Some products (like ChatGPT's "Memory" feature) extract and persist key facts from conversations: "User prefers Python over JavaScript," "User works at a Series B fintech company." These facts are stored in a structured database and injected into the system prompt for future conversations.
Trade-offs: What gets remembered is a product decision with privacy implications. Users need transparency and control—the ability to see what the AI "remembers" and delete specific memories. The extraction itself can be unreliable; the model might infer preferences that aren't actually true.
4. Structured state management
For task-oriented AI features (e.g., a travel booking agent), don't rely on the conversation to carry state. Parse the conversation into a structured state object—departure city, destination, dates, preferences—and pass that state explicitly in each call. This decouples the state from the conversation length.
Trade-offs: Requires upfront engineering to define the state schema and extraction logic. But it's the most robust pattern for multi-step workflows. If the conversation is 50 turns but the booking state is 200 tokens, you've eliminated the context window problem entirely for the critical data.
The cost dimension: Context windows as a budget
Every token in the context window costs money. Here's a framework for thinking about context window allocation:
- System prompt: 500-2,000 tokens. This is fixed overhead for every call. Keep it lean.
- Retrieved context (RAG): 1,000-4,000 tokens. Enough to ground the response without flooding the window.
- Conversation history: 2,000-8,000 tokens. The sliding window with summarization.
- User's current message: 100-1,000 tokens. Variable, user-controlled.
- Model output budget: 500-4,000 tokens. What's left for the response.
Total: maybe 15,000-20,000 tokens per call, even though the model supports 128K. You could use more—but should you? At $10/million tokens for GPT-4o input, going from 15K to 100K tokens per call is a 6.6x cost increase. At scale, that's the difference between a viable product and one that burns cash.
Practical guidelines for PMs
- Don't advertise context window size as a feature. Users don't care that your model supports 128K tokens. They care that the AI remembers their project name from yesterday.
- Design memory as a first-class feature. If your product needs multi-session memory, build it explicitly. Don't rely on the context window to do double duty as a memory system.
- Instrument your token usage. Log the token count for every API call: system prompt, context, history, completion. You cannot optimize what you don't measure.
- Set expectations in the UX. If the AI will forget things, tell the user. "I can see the last 10 messages" is better than silently losing context and giving a confused response.
- Test at the boundaries. Don't just test with 3-turn conversations in staging. Test with 50-turn conversations. Test with 20-page documents injected as context. The failure modes only appear at the edges.
The future: Is infinite context coming?
Google's Gemini already offers 1M+ token context windows. Research into more efficient attention mechanisms (sparse attention, linear attention, ring attention) continues to push the boundaries. But infinite context doesn't mean infinite memory. Even with a 10M token window, the lost-in-the-middle problem persists, and the cost scales linearly at best.
The more likely future is hybrid architectures: large context windows for immediate recall, combined with persistent memory stores for long-term knowledge, and RAG for grounding in external data. The products that get this right—that make AI feel like it truly remembers—will win.
RAG (Retrieval)
Fetch relevant docs at query time. Best for large, changing knowledge bases.
Summarization
Compress conversation history into summaries. Best for long-running sessions.
Sliding Window
Keep only recent N tokens. Simple but loses early context.
Hybrid Memory
Combine approaches: recent window + summarized history + RAG for facts.